AI Model VRAM Requirements (2026) — Exact GPU Memory for 182+ LLMs, Flux, SDXL & Video
Will your GPU run it? Exact VRAM for 182+ AI models (Qwen 3.5, Gemma 4, Llama 4, DeepSeek R1, Flux, SDXL, video) at Q4/Q8/FP16, with 8GB/12GB/24GB/48GB GPU picks.
If you are searching "how much VRAM does X need?", this is the master table. It covers the most common local AI queries, from Qwen 3.5 9B and Gemma 3 12B to Mistral Small 24B, DeepSeek R1, Flux, SDXL, and video models.
VRAM is the single biggest bottleneck for running AI models locally. You can have the fastest GPU on the market, but if your model does not fit in its memory, you will either get an out-of-memory error or suffer a massive performance penalty from CPU offloading. This guide gives you exact VRAM numbers for every major model at every quantization level, so you can make informed hardware decisions before downloading multi-gigabyte files.
Fast reference
- Qwen 3.5 9B: ~5.1 GB at Q4_K_M
- Gemma 3 12B: ~6.7 GB at Q4_K_M
- Mistral Small 24B: ~13.4 GB at Q4_K_M
- DeepSeek R1 32B: ~18.4 GB at Q4_K_M
- Flux and SDXL: see the dedicated image-model rows below
How VRAM Requirements Are Calculated
The memory a model needs comes from three sources: the model weights themselves, the KV (key-value) cache that stores attention state during inference, and the runtime overhead for the framework and CUDA context.
The core formula is:
VRAM ≈ (parameter_count × bytes_per_weight) + KV_cache + overhead
Bytes Per Weight by Quantization Level
Quantization reduces the number of bits used to store each weight. The trade-off is a small loss in quality for a large reduction in memory:
| Format | Bits per weight | Bytes per param | Quality loss |
|---|---|---|---|
| F32 | 32 | 4.0 | None (reference) |
| F16 / BF16 | 16 | 2.0 | Negligible |
| Q8_0 | 8 | 1.06 | ~0.1% |
| Q6_K | 6 | 0.81 | ~0.5% |
| Q5_K_M | 5 | 0.69 | ~1% |
| Q4_K_M | 4 | 0.56 | ~2-3% |
| Q3_K_M | 3 | 0.44 | ~5-8% |
| Q2_K | 2 | 0.31 | Noticeable |
The _K suffix means "k-quants" — a smarter quantization scheme from llama.cpp that applies different bit depths to different layers, preserving quality better than naive quantization at the same average bit width.
KV Cache Overhead
The KV cache stores intermediate attention computations for each token in your context window. Its size scales with context length, number of attention layers, and model dimension:
KV_cache ≈ 2 × num_layers × d_head × num_kv_heads × context_length × 2 bytes
For a typical 8B model (32 layers, 128 d_head, 8 KV heads) at 4K context, this works out to roughly 256MB. At 32K context it becomes 2GB — significant but still manageable. For very long contexts (128K+), KV cache can exceed the model weights themselves.
Runtime Overhead
Every inference runtime (llama.cpp, Ollama, LM Studio, vLLM) needs memory for:
- CUDA/Metal context initialization: ~200-400MB
- Activations and temporary buffers: ~200-500MB
- Graph compilation cache: ~100-200MB
Budget 500MB to 1GB of overhead on top of model weights and KV cache.
Worked Example: Llama 3.1 8B at Q4_K_M
Model weights: 8B params × 0.56 bytes/param = 4.48 GB
KV cache (4K): ~256 MB
Overhead: ~512 MB
───────────────────────────────
Total: ~5.25 GB
In practice, llama.cpp reports around 4.7-5.0 GB for this configuration — the formula gives a close upper bound. The slight variation comes from how different layers are quantized within the Q4_K_M scheme.
Complete VRAM Requirements Table
This is the definitive reference. All values represent approximate model weight size in VRAM. Add 1-2 GB for KV cache and runtime overhead at typical context lengths (4K-8K).
| Model | Params | Type | Q4_K_M | Q5_K_M | Q6_K | Q8_0 | F16 |
|---|---|---|---|---|---|---|---|
| Llama 3.2 3B | 3.2B | Dense | 1.8 GB | 2.2 GB | 2.6 GB | 3.4 GB | 6.4 GB |
| Phi-4-mini | 3.8B | Dense | 2.1 GB | 2.6 GB | 3.1 GB | 4.0 GB | 7.6 GB |
| Gemma 3 4B | 4B | Dense | 2.2 GB | 2.8 GB | 3.2 GB | 4.2 GB | 8.0 GB |
| Mistral 7B | 7.2B | Dense | 4.0 GB | 5.0 GB | 5.8 GB | 7.6 GB | 14.4 GB |
| Llama 3.1 8B | 8B | Dense | 4.5 GB | 5.5 GB | 6.5 GB | 8.5 GB | 16 GB |
| Qwen 3 8B | 8.2B | Dense | 4.6 GB | 5.7 GB | 6.6 GB | 8.7 GB | 16.4 GB |
| Gemma 3 12B | 12B | Dense | 6.7 GB | 8.3 GB | 9.7 GB | 12.7 GB | 24 GB |
| Phi-4 14B | 14B | Dense | 7.8 GB | 9.7 GB | 11.3 GB | 14.8 GB | 28 GB |
| Qwen 3 14B | 14.8B | Dense | 8.3 GB | 10.2 GB | 12.0 GB | 15.7 GB | 29.6 GB |
| Mistral Small 24B | 24B | Dense | 13.4 GB | 16.6 GB | 19.4 GB | 25.4 GB | 48 GB |
| Qwen 3 30B-A3B | 30B/3B | MoE | 16.8 GB | 20.7 GB | 24.3 GB | 31.8 GB | 60 GB |
| DeepSeek R1 32B | 32.8B | Dense | 18.4 GB | 22.6 GB | 26.6 GB | 34.8 GB | 65.6 GB |
| Llama 3.3 70B | 70.6B | Dense | 39.5 GB | 48.7 GB | 57.2 GB | 74.8 GB | 141 GB |
| Qwen 3 235B-A22B | 235B/22B | MoE | 131.6 GB | 162 GB | 190 GB | 249 GB | 470 GB |
| DeepSeek R1 671B | 671B/37B | MoE | 375.8 GB | 463 GB | 543 GB | 711 GB | 1342 GB |
Note: Values are approximate model weight sizes. Add 1-2 GB for KV cache and runtime overhead at 4K-8K context lengths. MoE models listed with total/active parameter counts.
Key takeaways from the table
- Q4_K_M is the sweet spot for most users. It cuts memory to 28% of F16 with only 2-3% quality loss.
- Every 1B parameters needs ~0.56 GB at Q4 — a rough mental model that works surprisingly well.
- MoE models break the simple rule — see the next section for why Qwen 3 30B-A3B needs less than its parameter count suggests.
- DeepSeek R1 671B is not a consumer model at any quantization. Even Q2_K (~200GB) requires a multi-GPU server cluster.
MoE Models: Why Active Parameters Matter
Mixture of Experts (MoE) models have a fundamentally different architecture than dense models. Instead of activating all parameters for every token, they route each token through a small subset of specialized "expert" networks. The rest of the parameters are loaded into memory but sit idle.
The result: you load all the weights, but only compute a fraction of them.
Take Qwen 3 30B-A3B: it has 30 billion total parameters, but only activates 3 billion for any given token. This means:
- VRAM loading: you still need ~17 GB at Q4_K_M to hold all the weights
- Compute cost: it runs at the speed of a 3B model, not a 30B model
- Quality: closer to a 14-20B dense model because the routing selects the most relevant experts
Compare this to a dense 30B model, which would need around 17 GB at Q4_K_M and compute at 30B model speed. The MoE model gives you 30B-level knowledge at 3B inference cost — which is the core appeal.
DeepSeek R1 671B takes this further: 671B total parameters, only 37B active. This is why it delivers frontier-level reasoning while being theoretically runnable on large Mac systems — the 375 GB Q4 requirement is for loading all weights, but each forward pass only touches a fraction of them.
Practical implication: When you see a MoE model listed as "30B/3B", the first number is what you need for VRAM planning, and the second is what determines inference speed.
How Context Length Affects VRAM
Context length has a direct and often underestimated impact on VRAM. The longer the context window you use, the more VRAM the KV cache consumes.
Here is the KV cache overhead for Llama 3.1 8B (F16 KV cache, typical settings) at different context lengths:
| Context Length | KV Cache | Total VRAM at Q4_K_M |
|---|---|---|
| 2K tokens | ~128 MB | ~4.8 GB |
| 4K tokens | ~256 MB | ~5.0 GB |
| 8K tokens | ~512 MB | ~5.3 GB |
| 16K tokens | ~1.0 GB | ~5.8 GB |
| 32K tokens | ~2.0 GB | ~6.8 GB |
| 128K tokens | ~8.0 GB | ~13 GB |
The jump from 4K to 128K context adds 8 GB of KV cache — enough to push a model from fitting on a 12 GB GPU to requiring 16 GB or more.
KV cache quantization is available in newer runtimes (llama.cpp supports Q8 and Q4 KV cache). Enabling Q8 KV cache roughly halves the overhead, making long-context inference practical on tighter VRAM budgets.
For a larger 70B model (80 layers, larger attention heads), the KV cache at 32K context is approximately 8 GB — significant enough to require planning when you are already near your GPU's memory limit.
Practical advice: Configure your context length to what you actually need. If you are doing conversational chat, 4K-8K is usually sufficient. Reserve 32K+ contexts for document analysis and long-form generation tasks where the extra VRAM cost is justified.
GPU VRAM Quick Reference
Knowing your GPU's VRAM is step one. Here is a current reference of popular GPUs and what they can realistically run:
| GPU | VRAM | Best Model Fit |
|---|---|---|
| RTX 4060 8GB | 8 GB | 7-8B models at Q4 |
| RTX 4070 12GB | 12 GB | 8B at Q6+, 14B at Q4 |
| RTX 4070 Ti Super | 16 GB | 14B at Q5+, 24B at Q4 |
| RTX 4080 Super | 16 GB | 14B at Q5+, 24B at Q4 |
| RTX 4090 | 24 GB | 30B MoE, 32B at Q4 |
| RTX 5090 | 32 GB | 70B at Q3 with offload |
| Mac M4 Pro 24GB | ~17 GB usable | Similar to RTX 4090 |
| Mac M4 Max 64GB | ~46 GB usable | 70B at Q5+ |
| Mac M4 Ultra 192GB | ~140 GB usable | 70B at F16, 235B MoE at Q4 |
A few notes on this table:
Apple Silicon unified memory is shared between CPU and GPU. macOS reserves roughly 25% for the OS and apps. A 64 GB Mac M4 Max practically has around 46-48 GB available for model weights. The benefit is that CPU offloading on Mac is far less painful than on discrete GPUs — the unified memory bus keeps offloaded layers much faster than PCIe-connected system RAM.
Dual GPU setups (e.g., two RTX 4090s via NVLink or tensor parallelism) add their VRAM pools. Two 4090s give you 48 GB, enough for a 70B model at Q5. Not all runtimes support multi-GPU efficiently — vLLM and llama.cpp with tensor parallelism are the best options here.
PCIe bandwidth matters for offloading. When a model partially overflows VRAM, layers are offloaded to system RAM. On PCIe 4.0 x16, the theoretical bandwidth is around 32 GB/s. In practice, you will see 15-20 GB/s. This means heavy offloading (more than 30-40% of layers) will significantly degrade throughput.
Practical Recommendations by VRAM Tier
8 GB VRAM (RTX 4060, RTX 3070, etc.)
You are in the most common consumer tier. Stick to models in the 3B-8B range:
- Best options: Llama 3.2 3B at Q6_K (2.6 GB), Phi-4-mini at Q4_K_M (2.1 GB), Mistral 7B at Q4_K_M (4.0 GB), Llama 3.1 8B at Q4_K_M (4.5 GB)
- Possible with offloading: 14B models at Q4, accepting 2-4x speed reduction
- Avoid: anything above 12B without significant quality degradation
12 GB VRAM (RTX 4070, RTX 3080 12GB)
You have meaningful headroom. The RTX 4070 12GB is a strong value GPU for local AI:
- Best options: Qwen 3 8B at Q6_K (6.6 GB), Gemma 3 12B at Q4_K_M (6.7 GB), Phi-4 14B at Q4_K_M (7.8 GB)
- Possible with offloading: 24B models at Q4
- Advantage over 8 GB: you can run 14B models fully in VRAM, which are meaningfully more capable than 8B
16 GB VRAM (RTX 4070 Ti Super, RTX 4080 Super)
The 16 GB tier opens up the 14-24B range comfortably:
- Best options: Qwen 3 14B at Q5_K_M (10.2 GB), Mistral Small 24B at Q4_K_M (13.4 GB)
- Stretch goal: DeepSeek R1 32B at Q4_K_M (18.4 GB) with modest offloading
- Sweet spot workload: Code generation and reasoning tasks where 14-24B models outperform smaller alternatives significantly
24 GB VRAM (RTX 4090, RTX 3090 Ti)
The RTX 4090 is the de facto standard for serious local AI work:
- Best options: Mistral Small 24B at Q5_K_M (16.6 GB), Qwen 3 30B-A3B at Q4_K_M (16.8 GB), DeepSeek R1 32B at Q4_K_M (18.4 GB)
- Llama 3.3 70B: feasible with 50-60% offloading; expect ~3-5 tokens/sec
- The 4090 gap: there is a meaningful jump between 24 GB and the next tier — the 4090 cannot fit a 70B model cleanly, which is its main limitation
32+ GB VRAM (RTX 5090, professional GPUs)
The RTX 5090's 32 GB closes the gap significantly:
- Llama 3.3 70B at Q3_K_M: ~30 GB, fits with minimal headroom
- DeepSeek R1 32B at Q6_K: 26.6 GB, fits cleanly
- Full 70B at Q4: requires 39 GB, still needs offloading
48-64 GB (Mac M4 Max, dual 4090, server GPUs)
This tier handles the 70B class properly:
- Mac M4 Max 64GB: ~46 GB usable — runs Llama 3.3 70B at Q5_K_M (48.7 GB) with light offloading; Q4_K_M (39.5 GB) fits cleanly
- NVIDIA A100 40GB / H100 80GB: professional GPUs designed for this workload
- Dual RTX 4090: 48 GB total with tensor parallelism enabled
Tips for Managing VRAM
Monitor actual usage before committing. Use nvidia-smi on Linux/Windows, or Activity Monitor on Mac, to track live VRAM consumption. Runtimes like Ollama report the estimated model size before loading.
Leave headroom for context. If your model fits in VRAM with 200 MB to spare, you may still get OOM errors when generating long responses. Budget at least 1-2 GB above the model weight size for runtime and KV cache.
Consider KV cache quantization. llama.cpp and Ollama support --kv-cache-type q8_0 and --kv-cache-type q4_0. Halving the KV cache size is often worth the minor quality reduction, especially for long-context tasks.
CPU offloading is a spectrum, not a binary. Most runtimes let you specify how many layers to keep in VRAM (--n-gpu-layers in llama.cpp). Offloading 5-10 layers of a 70B model (which has 80 total layers) is often barely noticeable in speed. Offloading 50+ layers will be slow.
Flash attention reduces peak memory. Enable flash attention when your runtime supports it (--flash-attn in llama.cpp). It reduces the peak memory during the attention computation without affecting output quality.
MoE models load fully but run partially. You cannot partially load a MoE model's experts — the entire weight file must fit in memory. However, you can still apply quantization to the loaded weights. Qwen 3 30B-A3B at Q4_K_M is an excellent choice for 24 GB GPUs wanting near-30B quality.
Check Your Hardware
Every number in this guide is an approximation. Real-world VRAM usage depends on your exact runtime version, batch size, system RAM speed, and OS overhead. The best way to know whether a specific model will run well on your hardware is to check it directly.
Use the Will It Run AI calculator to get a personalized fit estimate for your GPU or Mac with any model at any quantization level. It accounts for your exact hardware specs, the model's architecture, and typical runtime overhead — giving you a pass/fail verdict with expected performance tier.
You can also browse all supported models filtered by your VRAM budget, or check out our guides on what LLM you can run locally and the complete GGUF quantization guide for deeper context on choosing the right format for your setup.