How Much VRAM Do You Need to Run LLMs Locally? (2026 Guide)
Practical VRAM requirements for running LLMs locally in 2026. From 7B to 405B models, quantization impact, and hardware recommendations for every budget.
VRAM is the central constraint for running large language models locally. It determines which models load, how fast they generate tokens, and whether you even get a response or an out-of-memory error. Before you download a 40 GB model file, you want to know if it fits.
This guide cuts straight to the numbers. No fluff — just practical VRAM requirements for every common model size tier, with quantization impact and hardware recommendations you can act on today.
The Core Rule: Model Size Drives VRAM
The single most useful formula for estimating VRAM is:
VRAM (GB) ≈ Parameters (billions) × bytes per parameter + ~2 GB overhead
The bytes-per-parameter depends on quantization:
| Quantization | Bytes/Param | Example: 7B model | Example: 70B model |
|---|---|---|---|
| FP16 (no quantization) | 2.00 | ~14 GB | ~140 GB |
| Q8_0 | 1.06 | ~7.4 GB | ~74 GB |
| Q6_K | 0.81 | ~5.7 GB | ~57 GB |
| Q5_K_M | 0.69 | ~4.8 GB | ~48 GB |
| Q4_K_M | 0.56 | ~3.9 GB | ~39 GB |
| Q3_K_M | 0.44 | ~3.1 GB | ~31 GB |
The overhead (KV cache + runtime) adds roughly 1–2 GB for small models and 2–4 GB for larger ones. Context length also affects KV cache — a very long context window can add several gigabytes.
VRAM Requirements by Model Size Tier
3B–4B Models (e.g., Llama 3.2 3B, Phi-4-mini, Gemma 3 4B)
These are the most accessible models. At Q4_K_M they need only 2–3 GB of VRAM — running comfortably even on 6 GB GPUs.
| Quantization | VRAM Required |
|---|---|
| Q4_K_M | 2–3 GB |
| Q6_K | 2.5–3.5 GB |
| Q8_0 | 3.5–4.5 GB |
Best for: Fast, lightweight tasks. Coding assistance, summarization, quick chat. Runs on any modern GPU.
Hardware fit: RTX 4060 8GB, RTX 3060 12GB, or any GPU with 6+ GB VRAM. Even an integrated graphics setup with shared memory can manage Q4.
7B–8B Models (e.g., Llama 3.1 8B, Qwen 3 8B, Mistral 7B)
The most popular tier for home users. At Q4_K_M they fit comfortably in 8 GB VRAM with room for context.
| Quantization | VRAM Required |
|---|---|
| Q4_K_M | ~5 GB |
| Q5_K_M | ~5.5–6 GB |
| Q6_K | ~6.5–7 GB |
| Q8_0 | ~8.5–9 GB |
Best for: The everyday workhorse. Chat, writing, coding, reasoning. Quality is genuinely useful for most tasks.
Hardware fit: Q4–Q5 runs on RTX 4060 8GB. Q6–Q8 needs 12 GB (RTX 4070 or RTX 3060 12GB). These models are the sweet spot for budget builds. Check compatibility for your GPU →
13B–14B Models (e.g., Qwen 2.5 14B, Phi-4 14B)
A meaningful step up in quality from 8B. Noticeably better reasoning and coding output, at the cost of needing more VRAM.
| Quantization | VRAM Required |
|---|---|
| Q4_K_M | ~8–9 GB |
| Q5_K_M | ~10–11 GB |
| Q6_K | ~12–13 GB |
| Q8_0 | ~15–16 GB |
Best for: Users who want better-than-8B quality without jumping to 30B+ memory requirements. Great coding and instruction-following.
Hardware fit: Q4 on RTX 4060 Ti 16GB or RTX 4070 12GB. Q6–Q8 needs 16 GB (RTX 4080). These are the top-end models for a 12 GB GPU budget.
30B–34B Models (e.g., Qwen 3 30B, DeepSeek R1 32B Distill)
High-capability models. At Q4 they need around 18–20 GB VRAM — pushing past what most consumer GPUs offer.
| Quantization | VRAM Required |
|---|---|
| Q4_K_M | ~17–21 GB |
| Q5_K_M | ~21–24 GB |
| Q6_K | ~25–27 GB |
Best for: Complex reasoning, professional-level coding, nuanced writing tasks where 14B shows its limits.
Hardware fit: A 24 GB RTX 4090 or RTX 3090 runs these at Q4 with room to spare. Mac M2 Max (32 GB), M3 Max (48 GB), or M4 Max (64 GB) handle them at Q6+ comfortably. Browse 30B models →
MoE Models: Special Case (e.g., Qwen 3 30B-A3B, Mixtral 8x7B)
Mixture-of-Experts (MoE) models look like large models but behave like smaller ones at runtime. Qwen 3 30B-A3B has 30B total parameters but only ~3B active per token.
Critical: You must fit all parameters in VRAM (even inactive ones), but inference speed and quality match a ~3B active model.
| Model | Total Params | Active Params | Q4 VRAM |
|---|---|---|---|
| Qwen 3 30B-A3B | 30B | ~3B | ~19 GB |
| Mixtral 8x7B | 47B | ~13B | ~26 GB |
| Qwen 3 235B-A22B | 235B | ~22B | ~130 GB |
MoE models offer exceptional quality-per-VRAM for inference when they fit. Check Qwen 3 30B-A3B compatibility →
70B Models (e.g., Llama 3.3 70B, Qwen 2.5 72B)
The frontier of single-machine local inference. These are models where results are genuinely close to frontier proprietary models like GPT-4.
| Quantization | VRAM Required |
|---|---|
| Q4_K_M | ~39 GB |
| Q5_K_M | ~48 GB |
| Q8_0 | ~74 GB |
Hardware fit: No consumer GPU fits these without offloading. Options:
- RTX 4090 (24 GB) + CPU offload: runs at ~30–50% speed penalty
- RTX 5090 (32 GB) + CPU offload: less offloading needed
- Mac M4 Max 64 GB: fits Q4 natively, comfortable
- Mac M4 Ultra 128 GB: fits Q8 natively
Check if your hardware can run Llama 3.3 70B →
100B–405B Models (e.g., Llama 3.1 405B, DeepSeek R1 671B)
Consumer hardware cannot run these without extreme multi-GPU setups or CPU offloading to the point of impracticality.
- Llama 3.1 405B at Q4: ~225 GB VRAM required
- DeepSeek R1 671B at Q4: ~372 GB VRAM required
At this scale, you are looking at A100/H100 clusters, or a Mac Studio/Mac Pro with M2 Ultra (192 GB) or M3 Ultra (192 GB). For most users, these models are better accessed via API.
How Quantization Changes the Calculation
Quantization is the single biggest lever for fitting models on limited hardware. The difference between FP16 and Q4_K_M is roughly 4x less VRAM for a moderate quality trade-off.
What you lose with each step down:
- FP16 → Q8: Nearly nothing. Q8 is essentially lossless. Saves ~47% VRAM.
- Q8 → Q6: Slight degradation, imperceptible in most tasks. Saves another 24% VRAM.
- Q6 → Q5: Noticeable on benchmarks, rarely on everyday tasks. Saves ~15%.
- Q5 → Q4: Moderate quality loss. Chat holds up well; coding and reasoning degrade. Saves another 19%.
- Q4 → Q3: Noticeable degradation. Outputs become less coherent. Only use as a last resort.
The practical rule: use the highest quantization your VRAM allows while keeping the entire model in VRAM. A model at Q4 fully in VRAM will always outperform the same model at Q8 with half its layers offloaded to CPU.
Explore how quantization levels affect your models →
VRAM Tiers and What They Unlock
Here is a practical summary of what each VRAM tier makes possible:
6–8 GB VRAM (RTX 4060, RTX 3070, RX 6700 XT)
- 3–4B models at Q6–Q8: great quality, fast
- 7–8B models at Q4: good quality, the everyday workhorse
- 13B at Q3: possible but degraded — not recommended
Best pick: Llama 3.2 3B or Phi-4-mini at Q6 for speed; Llama 3.1 8B or Qwen 3 8B at Q4 for quality.
12 GB VRAM (RTX 4070, RTX 3060 12GB)
- 7–8B models at Q8: near-lossless quality
- 13–14B models at Q4–Q5: comfortable fit
- 30B MoE (like Qwen 3 30B-A3B) at Q2–Q3: tight, marginal quality
Best pick: Qwen 3 8B at Q8, or Phi-4 14B at Q5_K_M for excellent results.
16–20 GB VRAM (RTX 4060 Ti 16GB, RTX 4080, A4000)
- 14B models at Q8: essentially lossless
- 30B models at Q4–Q5: comfortable
- 70B at Q2–Q3: possible with quality trade-off
Best pick: Qwen 2.5 14B at Q8, or DeepSeek R1 32B Distill at Q4 for strong reasoning.
24 GB VRAM (RTX 4090, RTX 3090, RX 7900 XTX)
- 30B models at Q6: high quality
- 70B at Q4 with ~30% CPU offload: acceptable performance
- 70B fully native: not without offload on single GPU
Best pick: Qwen 3 30B at Q6, or Llama 3.3 70B at Q4 with minimal offloading. See what fits a 24GB GPU →
32 GB VRAM (RTX 5090, RTX 6000 Ada)
- 70B models at Q4: just barely fits
- 30B models at Q8: comfortable
- 70B at Q5: needs slight offload
Best pick: Llama 3.3 70B at Q4 fully in VRAM, or Qwen 3 72B at Q4.
Apple Silicon Unified Memory (M4 Max 64 GB, M4 Ultra 128 GB)
Apple Silicon changes the VRAM equation because the CPU and GPU share the same memory pool. There's no VRAM vs RAM split.
- M4 Max 64 GB: 70B at Q4 comfortably; 30B at Q8; 14B at FP16
- M4 Ultra 128 GB: 70B at Q8; 100B+ at Q4; 405B with some offloading
Speed is slower than equivalent discrete GPUs, but capacity is unmatched in consumer hardware.
Compare Apple Silicon configs →
Practical Recommendations
Under $300 GPU budget: RTX 4060 8GB. Runs 7–8B models at Q4 well. Great entry point.
$400–600 GPU budget: RTX 4070 12GB. The sweet spot — 14B at Q5, 8B at Q8. Most versatile consumer choice.
$800–1200 GPU budget: RTX 4090 24GB. Maximum consumer VRAM until RTX 5090. Runs 30B at Q6.
For very large models: Mac M4 Max or Mac Studio M4 Ultra. More memory capacity than any consumer GPU at a premium price.
Already have hardware? Check what you can run →
Key Takeaways
- VRAM is the primary constraint. Speed matters, but only if the model fits first.
- Quantization is not optional for most users — Q4_K_M at your target model size is the starting point, not the compromise.
- A smaller model fully in VRAM beats a larger model half-offloaded. Always.
- MoE models offer exceptional quality per VRAM — check if there's a MoE option before assuming you can't run large models.
- Apple Silicon's unified memory is a genuine advantage for fitting large models, at a speed trade-off.
Use the hardware compatibility calculator to find the best models for your specific GPU — it factors in VRAM, bandwidth, and architecture to tell you what will actually run well, not just what will technically load.
Check your hardware compatibility →
Related: VRAM reference table for all models | Best GPU for local AI | Quantization explained