How much VRAM do I need to run a 7B model?

A 7B parameter model needs approximately 4–5 GB VRAM at Q4_K_M quantization, or 7–8 GB at Q8. An RTX 3060 12GB, RTX 4060 8GB, or any GPU with 8+ GB VRAM handles 7B models comfortably.

Can I run a 70B model on a single GPU?

A 70B model at Q4_K_M requires around 39 GB VRAM. No single consumer GPU fits this fully — the RTX 5090 has 32 GB, so it needs some CPU offloading. A Mac M4 Max with 64 GB or M4 Ultra with 128 GB unified memory handles 70B models comfortably.

What is the minimum VRAM for running LLMs?

6–8 GB VRAM is the practical minimum for a useful experience. At 6 GB you can run 3–4B models at Q4. At 8 GB you can run 7–8B models at Q4. Below 6 GB, most models require so much CPU offloading that inference becomes too slow for interactive use.

Does RAM matter as much as VRAM for LLMs?

VRAM is primary — model weights must fit in VRAM for fast inference. System RAM matters for CPU offloading (when a model layer spills out of VRAM). Having 32–64 GB system RAM lets you run oversized models at reduced speed via offloading. Apple Silicon unifies VRAM and RAM, which changes this equation significantly.

How much VRAM do I need for a 13B or 14B model?

A 13–14B model needs approximately 8 GB at Q4_K_M, 10–11 GB at Q5_K_M, or 14–15 GB at Q8. An RTX 4070 12GB runs 14B models at Q4–Q5. A 16 GB GPU like the RTX 4080 handles them at Q8.

What VRAM tier do I need to run Llama 3 70B?

Llama 3.3 70B needs about 39 GB at Q4_K_M or 74 GB at Q8. To run it without offloading, you need an Apple Silicon Mac with 64+ GB unified memory, a professional GPU (A100 80GB, H100), or a multi-GPU setup. On a 24 GB RTX 4090 or 32 GB RTX 5090, it runs with partial CPU offloading at reduced speed.

March 27, 2026vram, llm, hardware, beginners-guide

How Much VRAM Do You Need to Run LLMs Locally? (2026 Guide)

Practical VRAM requirements for running LLMs locally in 2026. From 7B to 405B models, quantization impact, and hardware recommendations for every budget.

VRAM is the central constraint for running large language models locally. It determines which models load, how fast they generate tokens, and whether you even get a response or an out-of-memory error. Before you download a 40 GB model file, you want to know if it fits.

This guide cuts straight to the numbers. No fluff — just practical VRAM requirements for every common model size tier, with quantization impact and hardware recommendations you can act on today.

The Core Rule: Model Size Drives VRAM

The single most useful formula for estimating VRAM is:

VRAM (GB) ≈ Parameters (billions) × bytes per parameter + ~2 GB overhead

The bytes-per-parameter depends on quantization:

Quantization	Bytes/Param	Example: 7B model	Example: 70B model
FP16 (no quantization)	2.00	~14 GB	~140 GB
Q8_0	1.06	~7.4 GB	~74 GB
Q6_K	0.81	~5.7 GB	~57 GB
Q5_K_M	0.69	~4.8 GB	~48 GB
Q4_K_M	0.56	~3.9 GB	~39 GB
Q3_K_M	0.44	~3.1 GB	~31 GB

The overhead (KV cache + runtime) adds roughly 1–2 GB for small models and 2–4 GB for larger ones. Context length also affects KV cache — a very long context window can add several gigabytes.

VRAM Requirements by Model Size Tier

3B–4B Models (e.g., Llama 3.2 3B, Phi-4-mini, Gemma 3 4B)

These are the most accessible models. At Q4_K_M they need only 2–3 GB of VRAM — running comfortably even on 6 GB GPUs.

Quantization	VRAM Required
Q4_K_M	2–3 GB
Q6_K	2.5–3.5 GB
Q8_0	3.5–4.5 GB

Best for: Fast, lightweight tasks. Coding assistance, summarization, quick chat. Runs on any modern GPU.

Hardware fit: RTX 4060 8GB, RTX 3060 12GB, or any GPU with 6+ GB VRAM. Even an integrated graphics setup with shared memory can manage Q4.

7B–8B Models (e.g., Llama 3.1 8B, Qwen 3 8B, Mistral 7B)

The most popular tier for home users. At Q4_K_M they fit comfortably in 8 GB VRAM with room for context.

Quantization	VRAM Required
Q4_K_M	~5 GB
Q5_K_M	~5.5–6 GB
Q6_K	~6.5–7 GB
Q8_0	~8.5–9 GB

Best for: The everyday workhorse. Chat, writing, coding, reasoning. Quality is genuinely useful for most tasks.

Hardware fit: Q4–Q5 runs on RTX 4060 8GB. Q6–Q8 needs 12 GB (RTX 4070 or RTX 3060 12GB). These models are the sweet spot for budget builds. Check compatibility for your GPU →

13B–14B Models (e.g., Qwen 2.5 14B, Phi-4 14B)

A meaningful step up in quality from 8B. Noticeably better reasoning and coding output, at the cost of needing more VRAM.

Quantization	VRAM Required
Q4_K_M	~8–9 GB
Q5_K_M	~10–11 GB
Q6_K	~12–13 GB
Q8_0	~15–16 GB

Best for: Users who want better-than-8B quality without jumping to 30B+ memory requirements. Great coding and instruction-following.

Hardware fit: Q4 on RTX 4060 Ti 16GB or RTX 4070 12GB. Q6–Q8 needs 16 GB (RTX 4080). These are the top-end models for a 12 GB GPU budget.

30B–34B Models (e.g., Qwen 3 30B, DeepSeek R1 32B Distill)

High-capability models. At Q4 they need around 18–20 GB VRAM — pushing past what most consumer GPUs offer.

Quantization	VRAM Required
Q4_K_M	~17–21 GB
Q5_K_M	~21–24 GB
Q6_K	~25–27 GB

Best for: Complex reasoning, professional-level coding, nuanced writing tasks where 14B shows its limits.

Hardware fit: A 24 GB RTX 4090 or RTX 3090 runs these at Q4 with room to spare. Mac M2 Max (32 GB), M3 Max (48 GB), or M4 Max (64 GB) handle them at Q6+ comfortably. Browse 30B models →

MoE Models: Special Case (e.g., Qwen 3 30B-A3B, Mixtral 8x7B)

Mixture-of-Experts (MoE) models look like large models but behave like smaller ones at runtime. Qwen 3 30B-A3B has 30B total parameters but only ~3B active per token.

Critical: You must fit all parameters in VRAM (even inactive ones), but inference speed and quality match a ~3B active model.

Model	Total Params	Active Params	Q4 VRAM
Qwen 3 30B-A3B	30B	~3B	~19 GB
Mixtral 8x7B	47B	~13B	~26 GB
Qwen 3 235B-A22B	235B	~22B	~130 GB

MoE models offer exceptional quality-per-VRAM for inference when they fit. Check Qwen 3 30B-A3B compatibility →

70B Models (e.g., Llama 3.3 70B, Qwen 2.5 72B)

The frontier of single-machine local inference. These are models where results are genuinely close to frontier proprietary models like GPT-4.

Quantization	VRAM Required
Q4_K_M	~39 GB
Q5_K_M	~48 GB
Q8_0	~74 GB

Hardware fit: No consumer GPU fits these without offloading. Options:

RTX 4090 (24 GB) + CPU offload: runs at ~30–50% speed penalty
RTX 5090 (32 GB) + CPU offload: less offloading needed
Mac M4 Max 64 GB: fits Q4 natively, comfortable
Mac M4 Ultra 128 GB: fits Q8 natively

Check if your hardware can run Llama 3.3 70B →

100B–405B Models (e.g., Llama 3.1 405B, DeepSeek R1 671B)

Consumer hardware cannot run these without extreme multi-GPU setups or CPU offloading to the point of impracticality.

Llama 3.1 405B at Q4: ~225 GB VRAM required
DeepSeek R1 671B at Q4: ~372 GB VRAM required

At this scale, you are looking at A100/H100 clusters, or a Mac Studio/Mac Pro with M2 Ultra (192 GB) or M3 Ultra (192 GB). For most users, these models are better accessed via API.

How Quantization Changes the Calculation

Quantization is the single biggest lever for fitting models on limited hardware. The difference between FP16 and Q4_K_M is roughly 4x less VRAM for a moderate quality trade-off.

What you lose with each step down:

FP16 → Q8: Nearly nothing. Q8 is essentially lossless. Saves ~47% VRAM.
Q8 → Q6: Slight degradation, imperceptible in most tasks. Saves another 24% VRAM.
Q6 → Q5: Noticeable on benchmarks, rarely on everyday tasks. Saves ~15%.
Q5 → Q4: Moderate quality loss. Chat holds up well; coding and reasoning degrade. Saves another 19%.
Q4 → Q3: Noticeable degradation. Outputs become less coherent. Only use as a last resort.

The practical rule: use the highest quantization your VRAM allows while keeping the entire model in VRAM. A model at Q4 fully in VRAM will always outperform the same model at Q8 with half its layers offloaded to CPU.

Explore how quantization levels affect your models →

VRAM Tiers and What They Unlock

Here is a practical summary of what each VRAM tier makes possible:

6–8 GB VRAM (RTX 4060, RTX 3070, RX 6700 XT)

3–4B models at Q6–Q8: great quality, fast
7–8B models at Q4: good quality, the everyday workhorse
13B at Q3: possible but degraded — not recommended

Best pick: Llama 3.2 3B or Phi-4-mini at Q6 for speed; Llama 3.1 8B or Qwen 3 8B at Q4 for quality.

12 GB VRAM (RTX 4070, RTX 3060 12GB)

7–8B models at Q8: near-lossless quality
13–14B models at Q4–Q5: comfortable fit
30B MoE (like Qwen 3 30B-A3B) at Q2–Q3: tight, marginal quality

Best pick: Qwen 3 8B at Q8, or Phi-4 14B at Q5_K_M for excellent results.

16–20 GB VRAM (RTX 4060 Ti 16GB, RTX 4080, A4000)

14B models at Q8: essentially lossless
30B models at Q4–Q5: comfortable
70B at Q2–Q3: possible with quality trade-off

Best pick: Qwen 2.5 14B at Q8, or DeepSeek R1 32B Distill at Q4 for strong reasoning.

24 GB VRAM (RTX 4090, RTX 3090, RX 7900 XTX)

30B models at Q6: high quality
70B at Q4 with ~30% CPU offload: acceptable performance
70B fully native: not without offload on single GPU

Best pick: Qwen 3 30B at Q6, or Llama 3.3 70B at Q4 with minimal offloading. See what fits a 24GB GPU →

32 GB VRAM (RTX 5090, RTX 6000 Ada)

70B models at Q4: just barely fits
30B models at Q8: comfortable
70B at Q5: needs slight offload

Best pick: Llama 3.3 70B at Q4 fully in VRAM, or Qwen 3 72B at Q4.

Apple Silicon Unified Memory (M4 Max 64 GB, M4 Ultra 128 GB)

Apple Silicon changes the VRAM equation because the CPU and GPU share the same memory pool. There's no VRAM vs RAM split.

M4 Max 64 GB: 70B at Q4 comfortably; 30B at Q8; 14B at FP16
M4 Ultra 128 GB: 70B at Q8; 100B+ at Q4; 405B with some offloading

Speed is slower than equivalent discrete GPUs, but capacity is unmatched in consumer hardware.

Compare Apple Silicon configs →

Practical Recommendations

Under $300 GPU budget: RTX 4060 8GB. Runs 7–8B models at Q4 well. Great entry point.

$400–600 GPU budget: RTX 4070 12GB. The sweet spot — 14B at Q5, 8B at Q8. Most versatile consumer choice.

$800–1200 GPU budget: RTX 4090 24GB. Maximum consumer VRAM until RTX 5090. Runs 30B at Q6.

For very large models: Mac M4 Max or Mac Studio M4 Ultra. More memory capacity than any consumer GPU at a premium price.

Already have hardware? Check what you can run →

Key Takeaways

VRAM is the primary constraint. Speed matters, but only if the model fits first.
Quantization is not optional for most users — Q4_K_M at your target model size is the starting point, not the compromise.
A smaller model fully in VRAM beats a larger model half-offloaded. Always.
MoE models offer exceptional quality per VRAM — check if there's a MoE option before assuming you can't run large models.
Apple Silicon's unified memory is a genuine advantage for fitting large models, at a speed trade-off.

Use the hardware compatibility calculator to find the best models for your specific GPU — it factors in VRAM, bandwidth, and architecture to tell you what will actually run well, not just what will technically load.

Check your hardware compatibility →