What Can You Run on 16GB, 24GB, 32GB VRAM? — Local LLM Guide (April 2026)
What local LLMs fit on 16GB, 24GB, or 32GB VRAM in 2026: top model per tier with Q4/Q8 numbers, tokens/sec on RTX 4080/4090/5090, coding picks, and when to upgrade.
Short answer: 24 GB VRAM is the 2026 sweet spot. 16 GB still runs a great lineup, 32 GB adds headroom for Q5/Q6 and 1M-context Qwen 3.6 workloads.
This guide gives the exact local LLMs that fit on 16 GB, 24 GB, and 32 GB VRAM in April 2026 — per-tier top picks, tokens/second on common GPUs, coding vs chat picks, and when upgrading is actually worth it.
Need a fit check for a specific GPU? Use the VRAM Calculator. For a ranked list against a specific Apple Silicon Mac, see Best Local LLMs for MacBook Air M4 24GB or MacBook Pro M4 Pro 24GB.
16 GB VRAM — Mid-range (RTX 4060 Ti 16GB, RTX 4070 Ti 16GB, RTX 5070, RTX 4080 16GB, RX 7900 GRE)
The sweet spot for dense 9-27B models. Qwen3.6-27B (released April 22, 2026) fits at Q4_K_M (16.8 GB) and is the best model at this tier.
| Use case | Model | VRAM Q4 | Best quant | tok/s (RTX 4070 Ti) |
|---|---|---|---|---|
| Best overall (NEW) | Qwen3.6-27B | 16.8 GB | Q4_K_M | ~38 |
| Best coding / agentic | Qwen3.6-27B | 16.8 GB | Q4_K_M | ~38 |
| General chat | Qwen 3.5 9B | ~5.1 GB | Q8_0 (~10 GB) | ~70 |
| Coding (small / fast) | Qwen 3 Coder 14B | ~8.3 GB | Q6_K (~12 GB) | ~50 |
| Previous-gen dense | Qwen 3.5 27B | ~16 GB | Q4_K_M tight | ~38 |
| Instruction-follow | Llama 3.1 8B | ~4.6 GB | Q8_0 (~8 GB) | ~80 |
| Reasoning / Math | DeepSeek R1 Distill 14B | ~8 GB | Q5_K_M | ~60 |
Does NOT fit at useful quant: Qwen3.6-35B-A3B MoE (~21 GB), Qwen 3.5 35B-A3B (~21 GB), DeepSeek R1 32B full, any Llama 4 variant.
Upgrade trigger: If you want MoE efficiency or long-context agentic workloads, jump to 24 GB.
24 GB VRAM — Enthusiast (RTX 4090, RTX 3090, RTX 5090 32GB, RX 7900 XTX, Mac M4 Pro 24GB)
The 2026 sweet spot. Handles the best MoE models, top dense 27-32B, long 262K context.
| Use case | Model | VRAM Q4 | Best quant | tok/s (RTX 4090) |
|---|---|---|---|---|
| Best coding / flagship (NEW) | Qwen3.6-27B dense | 16.8 GB | Q6_K (22.5 GB) | ~60 |
| Best MoE throughput | Qwen3.6-35B-A3B | ~21 GB | Q4_K_M | ~70 |
| Best coding (prev-gen) | Qwen 3 Coder 30B-A3B | ~17 GB | Q4_K_M | ~75 |
| Dense reasoning | Qwen 3 32B | ~19 GB | Q4_K_M | ~55 |
| Prev-gen MoE | Qwen 3.5 35B-A3B | ~21 GB | Q4_K_M | ~70 |
| Math / Code | DeepSeek R1 Distill 32B | ~19 GB | Q4_K_M | ~50 |
Does NOT fit at useful quant: Llama 4 Maverick (requires 128GB), DeepSeek V3 full, Qwen 3.5 122B-A10B (needs 80GB).
Upgrade trigger: If you need Q6/Q8 on 35B-A3B (for coding precision) or 1M-context workflows, jump to 32 GB or Mac 36-64 GB.
32 GB VRAM — High-end consumer (RTX 5090 32GB)
Q5/Q6 on 35B-A3B, partial offload to Llama 4 Scout, Qwen 3.6 1M context.
| Use case | Model | VRAM | Best quant | tok/s (RTX 5090) |
|---|---|---|---|---|
| Best coding (NEW) | Qwen3.6-27B dense | ~28.6 GB | Q8_0 | ~85 |
| Best MoE | Qwen3.6-35B-A3B | ~25 GB | Q5_K_M | ~90 |
| Best prev-gen coding | Qwen 3 Coder 30B-A3B | ~20 GB | Q6_K | ~85 |
| Long context | Qwen 3.5 27B (128K) | ~18 GB | Q8_0 | ~55 |
| Dense reasoning | Qwen 3 32B | ~23 GB | Q5_K_M | ~60 |
| Partial offload | Llama 4 Scout 109B | ~28 GB (partial) | Q4 offload | ~15 |
| 1M-context (Qwen 3.6) | Qwen3.6-35B-A3B | ~21-40 GB | Q4-Q5 YaRN | ~90 |
Upgrade trigger: For 35B-A3B at Q8 (effectively FP16 quality) or multi-model concurrent use, go to 48 GB+ (RTX A6000, Mac M4 Max 64GB).
Apple Silicon unified memory equivalents
Unified memory behaves differently: macOS reserves ~15-25% for system. Effective "LLM headroom":
| Mac config | Effective LLM RAM | Closest GPU tier |
|---|---|---|
| MacBook Air M4 16GB | ~12 GB | RTX 4060 Ti 16GB |
| MacBook Air M4 24GB | ~19 GB | between 16 and 24 GB tiers |
| MacBook Pro M4 24GB | ~19 GB | between 16 and 24 GB |
| MacBook Pro M4 Pro 24GB | ~20 GB | 24 GB class (higher bandwidth) |
| MacBook Pro M4 Max 36GB | ~30 GB | 32 GB class |
| MacBook Pro M4 Max 48GB | ~40 GB | 32-48 GB class |
| MacBook Pro M4 Max 64GB | ~54 GB | 48 GB class |
| Mac Studio M3 Ultra 96GB | ~80 GB | workstation |
See MacBook Air M4 vs Pro M4 for Local LLMs for the full decision guide.
Expected tokens per second (Q4_K_M)
| Model | RTX 4060 Ti 16GB | RTX 4090 24GB | RTX 5090 32GB | Mac M4 Pro 24GB | Mac M4 Max 64GB |
|---|---|---|---|---|---|
| Qwen 3 8B | 50 | 85 | 110 | 40 | 45 |
| Qwen 3.5 9B | 50 | 85 | 110 | 40 | 45 |
| Qwen 3 14B | 35 | 55 | 70 | 22 | 28 |
| Qwen 3.5 27B | — | 35 | 45 | 18 | 24 |
| Qwen 3 30B-A3B MoE | — | 70 | 90 | 34 | 42 |
| Qwen 3.5 35B-A3B | — | 65 | 85 | 30 | 40 |
| DeepSeek R1 Distill 32B | — | 45 | 60 | 20 | 26 |
Decision framework
- You type slowly or mostly chat: 16 GB is fine, stick with Qwen 3.5 9B Q8.
- You code all day: 24 GB (RTX 4090/3090) + Qwen 3 Coder 30B-A3B. Best ROI.
- You want MoE + long context: 32 GB RTX 5090 or Mac M4 Max 36-48 GB.
- You run a team workstation: 48-64 GB (Mac Studio / Mac Pro M4 Max) for Q8 on 35B-A3B.
- You run API for multiple users: skip consumer GPUs; go H100 80GB or datacenter multi-GPU (see Multi-GPU LLM Inference Guide).
Related guides
- Qwen 3.6 VRAM & Release Date — latest flagship MoE
- Qwen3.6-35B-A3B Hardware Requirements (Buyer Guide)
- Best Local Coding LLMs for Apple Silicon 24GB
- Best GPU for Running LLMs Locally (2026)
- Best Local LLMs by VRAM Tier — 11 tiers ranked
- VRAM Calculator — check any combo