How much VRAM does Llama 3 need?

Llama 3.1 8B needs approximately 4.3GB at Q4_K_M, 5.5GB at Q6_K, or 8.5GB at Q8. Llama 3.3 70B needs 39GB at Q4_K_M or 74GB at Q8. The exact amount depends on quantization level and context length.

Can I run AI models with 8GB VRAM?

Yes. With 8GB VRAM you can run models up to about 7-8B parameters at Q4-Q5 quantization, or 3-4B models at Q8. Popular choices include Llama 3.2 3B, Phi-4-mini, and Gemma 3 4B.

How much VRAM does DeepSeek R1 need?

DeepSeek R1 comes in multiple sizes. The 7B distill needs ~4GB at Q4. The 32B distill needs ~18GB at Q4. The full 671B model needs ~370GB at Q4, requiring multiple GPUs or a high-memory Mac.

Does quantization reduce VRAM usage?

Yes, significantly. Going from F16 to Q4_K_M reduces VRAM by about 72%. A 14B model drops from ~28GB (F16) to ~8GB (Q4_K_M). This is why quantization is essential for running large models on consumer hardware.

What happens if my model doesn't fit in VRAM?

If a model exceeds your VRAM, some layers get offloaded to system RAM (CPU offloading). This works but dramatically reduces speed — typically 5-20x slower for offloaded layers. Partial offloading (10-20%) is often acceptable.

How much VRAM for a 70B model?

A 70B model needs approximately 39GB at Q4_K_M, 48GB at Q5_K_M, or 74GB at Q8. The RTX 4090 (24GB) can run it with significant offloading. A Mac M4 Max 64GB or RTX 5090 32GB + offload handles it better.

April 20, 2026vram, gpu, hardware, technical

AI Model VRAM Requirements (2026) — Exact GPU Memory for 182+ LLMs, Flux, SDXL & Video

Will your GPU run it? Exact VRAM for 182+ AI models (Qwen 3.5, Gemma 4, Llama 4, DeepSeek R1, Flux, SDXL, video) at Q4/Q8/FP16, with 8GB/12GB/24GB/48GB GPU picks.

If you are searching "how much VRAM does X need?", this is the master table. It covers the most common local AI queries, from Qwen 3.5 9B and Gemma 3 12B to Mistral Small 24B, DeepSeek R1, Flux, SDXL, and video models.

VRAM is the single biggest bottleneck for running AI models locally. You can have the fastest GPU on the market, but if your model does not fit in its memory, you will either get an out-of-memory error or suffer a massive performance penalty from CPU offloading. This guide gives you exact VRAM numbers for every major model at every quantization level, so you can make informed hardware decisions before downloading multi-gigabyte files.

Fast reference

Qwen 3.5 9B: ~5.1 GB at Q4_K_M
Gemma 3 12B: ~6.7 GB at Q4_K_M
Mistral Small 24B: ~13.4 GB at Q4_K_M
DeepSeek R1 32B: ~18.4 GB at Q4_K_M
Flux and SDXL: see the dedicated image-model rows below

How VRAM Requirements Are Calculated

The memory a model needs comes from three sources: the model weights themselves, the KV (key-value) cache that stores attention state during inference, and the runtime overhead for the framework and CUDA context.

The core formula is:

VRAM ≈ (parameter_count × bytes_per_weight) + KV_cache + overhead

Bytes Per Weight by Quantization Level

Quantization reduces the number of bits used to store each weight. The trade-off is a small loss in quality for a large reduction in memory:

Format	Bits per weight	Bytes per param	Quality loss
F32	32	4.0	None (reference)
F16 / BF16	16	2.0	Negligible
Q8_0	8	1.06	~0.1%
Q6_K	6	0.81	~0.5%
Q5_K_M	5	0.69	~1%
Q4_K_M	4	0.56	~2-3%
Q3_K_M	3	0.44	~5-8%
Q2_K	2	0.31	Noticeable

The _K suffix means "k-quants" — a smarter quantization scheme from llama.cpp that applies different bit depths to different layers, preserving quality better than naive quantization at the same average bit width.

KV Cache Overhead

The KV cache stores intermediate attention computations for each token in your context window. Its size scales with context length, number of attention layers, and model dimension:

KV_cache ≈ 2 × num_layers × d_head × num_kv_heads × context_length × 2 bytes

For a typical 8B model (32 layers, 128 d_head, 8 KV heads) at 4K context, this works out to roughly 256MB. At 32K context it becomes 2GB — significant but still manageable. For very long contexts (128K+), KV cache can exceed the model weights themselves.

Runtime Overhead

Every inference runtime (llama.cpp, Ollama, LM Studio, vLLM) needs memory for:

CUDA/Metal context initialization: ~200-400MB
Activations and temporary buffers: ~200-500MB
Graph compilation cache: ~100-200MB

Budget 500MB to 1GB of overhead on top of model weights and KV cache.

Worked Example: Llama 3.1 8B at Q4_K_M

Model weights:  8B params × 0.56 bytes/param = 4.48 GB
KV cache (4K):  ~256 MB
Overhead:       ~512 MB
───────────────────────────────
Total:          ~5.25 GB

In practice, llama.cpp reports around 4.7-5.0 GB for this configuration — the formula gives a close upper bound. The slight variation comes from how different layers are quantized within the Q4_K_M scheme.

Complete VRAM Requirements Table

This is the definitive reference. All values represent approximate model weight size in VRAM. Add 1-2 GB for KV cache and runtime overhead at typical context lengths (4K-8K).

Model	Params	Type	Q4_K_M	Q5_K_M	Q6_K	Q8_0	F16
Llama 3.2 3B	3.2B	Dense	1.8 GB	2.2 GB	2.6 GB	3.4 GB	6.4 GB
Phi-4-mini	3.8B	Dense	2.1 GB	2.6 GB	3.1 GB	4.0 GB	7.6 GB
Gemma 3 4B	4B	Dense	2.2 GB	2.8 GB	3.2 GB	4.2 GB	8.0 GB
Mistral 7B	7.2B	Dense	4.0 GB	5.0 GB	5.8 GB	7.6 GB	14.4 GB
Llama 3.1 8B	8B	Dense	4.5 GB	5.5 GB	6.5 GB	8.5 GB	16 GB
Qwen 3 8B	8.2B	Dense	4.6 GB	5.7 GB	6.6 GB	8.7 GB	16.4 GB
Gemma 3 12B	12B	Dense	6.7 GB	8.3 GB	9.7 GB	12.7 GB	24 GB
Phi-4 14B	14B	Dense	7.8 GB	9.7 GB	11.3 GB	14.8 GB	28 GB
Qwen 3 14B	14.8B	Dense	8.3 GB	10.2 GB	12.0 GB	15.7 GB	29.6 GB
Mistral Small 24B	24B	Dense	13.4 GB	16.6 GB	19.4 GB	25.4 GB	48 GB
Qwen 3 30B-A3B	30B/3B	MoE	16.8 GB	20.7 GB	24.3 GB	31.8 GB	60 GB
DeepSeek R1 32B	32.8B	Dense	18.4 GB	22.6 GB	26.6 GB	34.8 GB	65.6 GB
Llama 3.3 70B	70.6B	Dense	39.5 GB	48.7 GB	57.2 GB	74.8 GB	141 GB
Qwen 3 235B-A22B	235B/22B	MoE	131.6 GB	162 GB	190 GB	249 GB	470 GB
DeepSeek R1 671B	671B/37B	MoE	375.8 GB	463 GB	543 GB	711 GB	1342 GB

Note: Values are approximate model weight sizes. Add 1-2 GB for KV cache and runtime overhead at 4K-8K context lengths. MoE models listed with total/active parameter counts.

Key takeaways from the table

Q4_K_M is the sweet spot for most users. It cuts memory to 28% of F16 with only 2-3% quality loss.
Every 1B parameters needs ~0.56 GB at Q4 — a rough mental model that works surprisingly well.
MoE models break the simple rule — see the next section for why Qwen 3 30B-A3B needs less than its parameter count suggests.
DeepSeek R1 671B is not a consumer model at any quantization. Even Q2_K (~200GB) requires a multi-GPU server cluster.

MoE Models: Why Active Parameters Matter

Mixture of Experts (MoE) models have a fundamentally different architecture than dense models. Instead of activating all parameters for every token, they route each token through a small subset of specialized "expert" networks. The rest of the parameters are loaded into memory but sit idle.

The result: you load all the weights, but only compute a fraction of them.

Take Qwen 3 30B-A3B: it has 30 billion total parameters, but only activates 3 billion for any given token. This means:

VRAM loading: you still need ~17 GB at Q4_K_M to hold all the weights
Compute cost: it runs at the speed of a 3B model, not a 30B model
Quality: closer to a 14-20B dense model because the routing selects the most relevant experts

Compare this to a dense 30B model, which would need around 17 GB at Q4_K_M and compute at 30B model speed. The MoE model gives you 30B-level knowledge at 3B inference cost — which is the core appeal.

DeepSeek R1 671B takes this further: 671B total parameters, only 37B active. This is why it delivers frontier-level reasoning while being theoretically runnable on large Mac systems — the 375 GB Q4 requirement is for loading all weights, but each forward pass only touches a fraction of them.

Practical implication: When you see a MoE model listed as "30B/3B", the first number is what you need for VRAM planning, and the second is what determines inference speed.

How Context Length Affects VRAM

Context length has a direct and often underestimated impact on VRAM. The longer the context window you use, the more VRAM the KV cache consumes.

Here is the KV cache overhead for Llama 3.1 8B (F16 KV cache, typical settings) at different context lengths:

Context Length	KV Cache	Total VRAM at Q4_K_M
2K tokens	~128 MB	~4.8 GB
4K tokens	~256 MB	~5.0 GB
8K tokens	~512 MB	~5.3 GB
16K tokens	~1.0 GB	~5.8 GB
32K tokens	~2.0 GB	~6.8 GB
128K tokens	~8.0 GB	~13 GB

The jump from 4K to 128K context adds 8 GB of KV cache — enough to push a model from fitting on a 12 GB GPU to requiring 16 GB or more.

KV cache quantization is available in newer runtimes (llama.cpp supports Q8 and Q4 KV cache). Enabling Q8 KV cache roughly halves the overhead, making long-context inference practical on tighter VRAM budgets.

For a larger 70B model (80 layers, larger attention heads), the KV cache at 32K context is approximately 8 GB — significant enough to require planning when you are already near your GPU's memory limit.

Practical advice: Configure your context length to what you actually need. If you are doing conversational chat, 4K-8K is usually sufficient. Reserve 32K+ contexts for document analysis and long-form generation tasks where the extra VRAM cost is justified.

GPU VRAM Quick Reference

Knowing your GPU's VRAM is step one. Here is a current reference of popular GPUs and what they can realistically run:

GPU	VRAM	Best Model Fit
RTX 4060 8GB	8 GB	7-8B models at Q4
RTX 4070 12GB	12 GB	8B at Q6+, 14B at Q4
RTX 4070 Ti Super	16 GB	14B at Q5+, 24B at Q4
RTX 4080 Super	16 GB	14B at Q5+, 24B at Q4
RTX 4090	24 GB	30B MoE, 32B at Q4
RTX 5090	32 GB	70B at Q3 with offload
Mac M4 Pro 24GB	~17 GB usable	Similar to RTX 4090
Mac M4 Max 64GB	~46 GB usable	70B at Q5+
Mac M4 Ultra 192GB	~140 GB usable	70B at F16, 235B MoE at Q4

A few notes on this table:

Apple Silicon unified memory is shared between CPU and GPU. macOS reserves roughly 25% for the OS and apps. A 64 GB Mac M4 Max practically has around 46-48 GB available for model weights. The benefit is that CPU offloading on Mac is far less painful than on discrete GPUs — the unified memory bus keeps offloaded layers much faster than PCIe-connected system RAM.

Dual GPU setups (e.g., two RTX 4090s via NVLink or tensor parallelism) add their VRAM pools. Two 4090s give you 48 GB, enough for a 70B model at Q5. Not all runtimes support multi-GPU efficiently — vLLM and llama.cpp with tensor parallelism are the best options here.

PCIe bandwidth matters for offloading. When a model partially overflows VRAM, layers are offloaded to system RAM. On PCIe 4.0 x16, the theoretical bandwidth is around 32 GB/s. In practice, you will see 15-20 GB/s. This means heavy offloading (more than 30-40% of layers) will significantly degrade throughput.

Practical Recommendations by VRAM Tier

8 GB VRAM (RTX 4060, RTX 3070, etc.)

You are in the most common consumer tier. Stick to models in the 3B-8B range:

Best options: Llama 3.2 3B at Q6_K (2.6 GB), Phi-4-mini at Q4_K_M (2.1 GB), Mistral 7B at Q4_K_M (4.0 GB), Llama 3.1 8B at Q4_K_M (4.5 GB)
Possible with offloading: 14B models at Q4, accepting 2-4x speed reduction
Avoid: anything above 12B without significant quality degradation

12 GB VRAM (RTX 4070, RTX 3080 12GB)

You have meaningful headroom. The RTX 4070 12GB is a strong value GPU for local AI:

Best options: Qwen 3 8B at Q6_K (6.6 GB), Gemma 3 12B at Q4_K_M (6.7 GB), Phi-4 14B at Q4_K_M (7.8 GB)
Possible with offloading: 24B models at Q4
Advantage over 8 GB: you can run 14B models fully in VRAM, which are meaningfully more capable than 8B

16 GB VRAM (RTX 4070 Ti Super, RTX 4080 Super)

The 16 GB tier opens up the 14-24B range comfortably:

Best options: Qwen 3 14B at Q5_K_M (10.2 GB), Mistral Small 24B at Q4_K_M (13.4 GB)
Stretch goal: DeepSeek R1 32B at Q4_K_M (18.4 GB) with modest offloading
Sweet spot workload: Code generation and reasoning tasks where 14-24B models outperform smaller alternatives significantly

24 GB VRAM (RTX 4090, RTX 3090 Ti)

The RTX 4090 is the de facto standard for serious local AI work:

Best options: Mistral Small 24B at Q5_K_M (16.6 GB), Qwen 3 30B-A3B at Q4_K_M (16.8 GB), DeepSeek R1 32B at Q4_K_M (18.4 GB)
Llama 3.3 70B: feasible with 50-60% offloading; expect ~3-5 tokens/sec
The 4090 gap: there is a meaningful jump between 24 GB and the next tier — the 4090 cannot fit a 70B model cleanly, which is its main limitation

32+ GB VRAM (RTX 5090, professional GPUs)

The RTX 5090's 32 GB closes the gap significantly:

Llama 3.3 70B at Q3_K_M: ~30 GB, fits with minimal headroom
DeepSeek R1 32B at Q6_K: 26.6 GB, fits cleanly
Full 70B at Q4: requires 39 GB, still needs offloading

48-64 GB (Mac M4 Max, dual 4090, server GPUs)

This tier handles the 70B class properly:

Mac M4 Max 64GB: ~46 GB usable — runs Llama 3.3 70B at Q5_K_M (48.7 GB) with light offloading; Q4_K_M (39.5 GB) fits cleanly
NVIDIA A100 40GB / H100 80GB: professional GPUs designed for this workload
Dual RTX 4090: 48 GB total with tensor parallelism enabled

Tips for Managing VRAM

Monitor actual usage before committing. Use nvidia-smi on Linux/Windows, or Activity Monitor on Mac, to track live VRAM consumption. Runtimes like Ollama report the estimated model size before loading.

Leave headroom for context. If your model fits in VRAM with 200 MB to spare, you may still get OOM errors when generating long responses. Budget at least 1-2 GB above the model weight size for runtime and KV cache.

Consider KV cache quantization. llama.cpp and Ollama support --kv-cache-type q8_0 and --kv-cache-type q4_0. Halving the KV cache size is often worth the minor quality reduction, especially for long-context tasks.

CPU offloading is a spectrum, not a binary. Most runtimes let you specify how many layers to keep in VRAM (--n-gpu-layers in llama.cpp). Offloading 5-10 layers of a 70B model (which has 80 total layers) is often barely noticeable in speed. Offloading 50+ layers will be slow.

Flash attention reduces peak memory. Enable flash attention when your runtime supports it (--flash-attn in llama.cpp). It reduces the peak memory during the attention computation without affecting output quality.

MoE models load fully but run partially. You cannot partially load a MoE model's experts — the entire weight file must fit in memory. However, you can still apply quantization to the loaded weights. Qwen 3 30B-A3B at Q4_K_M is an excellent choice for 24 GB GPUs wanting near-30B quality.

Check Your Hardware

Every number in this guide is an approximation. Real-world VRAM usage depends on your exact runtime version, batch size, system RAM speed, and OS overhead. The best way to know whether a specific model will run well on your hardware is to check it directly.

Use the Will It Run AI calculator to get a personalized fit estimate for your GPU or Mac with any model at any quantization level. It accounts for your exact hardware specs, the model's architecture, and typical runtime overhead — giving you a pass/fail verdict with expected performance tier.

You can also browse all supported models filtered by your VRAM budget, or check out our guides on what LLM you can run locally and the complete GGUF quantization guide for deeper context on choosing the right format for your setup.