What can you run on 16GB VRAM?

16GB VRAM comfortably runs Qwen 3.5 27B dense at Q4_K_M (~16 GB), Qwen 3.5 9B at Q8_0 (~10 GB) with long context, Llama 3.1 8B at Q8, and Qwen 3 14B at Q8. DeepSeek R1 32B and 30B-A3B-class MoE models are out of reach at useful quant.

What can you run on 24GB VRAM?

24GB VRAM is the sweet spot: Qwen 3.5 35B-A3B MoE at Q4_K_M (~21 GB), Qwen 3 30B-A3B at Q4, DeepSeek R1 Distill 32B at Q4, Qwen 3.5 27B dense at Q5 or Q6. Expect 60-90 tok/s on RTX 4090/5090. qwen3.6-35b-a3b will also fit at Q4 once open weights ship.

What can you run on 32GB VRAM?

32GB VRAM (RTX 5090) adds meaningful headroom: Qwen 3.5 35B-A3B at Q5 or Q6, Qwen 3.5 27B at Q8, DeepSeek R1 Distill 32B at Q5, and partial offload for Llama 4 Scout 109B. Handles 1M-context workflows with Qwen 3.6 series.

What is the largest model I can run on 24GB VRAM?

By total parameters: Qwen 3.5 35B-A3B (35B total, 3B active MoE) at Q4. By dense parameters: Qwen 3.5 27B at Q4. Via partial CPU offload you can run larger models at 5-10 tok/s but the practical ceiling at interactive speeds is 35B-A3B MoE.

Is 24GB VRAM enough for local AI in 2026?

Yes, for 90% of users. 24GB handles the best MoE models (35B-A3B class) at Q4, the top dense models up to 32B at Q4, and long 262K context. Upgrade to 32GB+ only if you need Q6/Q8 quality on 35B-A3B or sustained 1M-context workloads with Qwen 3.6.

Is 16GB VRAM still enough for local LLMs?

Yes for the 9B-14B class. Qwen 3.5 9B at Q8 and Qwen 3 14B at Q4/Q5 are daily drivers. For coding prefer Qwen 3 Coder 14B at Q5 or DeepSeek Coder V2.5 Lite at Q4. Falls short of 24GB only for MoE models and very long context.

What VRAM do I need for Qwen 3.6?

Qwen3.6-35B-A3B needs ~21 GB at Q4_K_M (same as Qwen 3.5 35B-A3B). Minimum: 24 GB VRAM or 32 GB unified memory. Recommended: RTX 4090, RTX 5090, or Mac M4 Pro 24GB+ for Q4 comfort.

April 22, 2026vram, local-llm, buyer-guide, 16gb, 24gb, 32gb

What Can You Run on 16GB, 24GB, 32GB VRAM? — Local LLM Guide (April 2026)

What local LLMs fit on 16GB, 24GB, or 32GB VRAM in 2026: top model per tier with Q4/Q8 numbers, tokens/sec on RTX 4080/4090/5090, coding picks, and when to upgrade.

Short answer: 24 GB VRAM is the 2026 sweet spot. 16 GB still runs a great lineup, 32 GB adds headroom for Q5/Q6 and 1M-context Qwen 3.6 workloads.

This guide gives the exact local LLMs that fit on 16 GB, 24 GB, and 32 GB VRAM in April 2026 — per-tier top picks, tokens/second on common GPUs, coding vs chat picks, and when upgrading is actually worth it.

Need a fit check for a specific GPU? Use the VRAM Calculator. For a ranked list against a specific Apple Silicon Mac, see Best Local LLMs for MacBook Air M4 24GB or MacBook Pro M4 Pro 24GB.

16 GB VRAM — Mid-range (RTX 4060 Ti 16GB, RTX 4070 Ti 16GB, RTX 5070, RTX 4080 16GB, RX 7900 GRE)

The sweet spot for dense 9-27B models. Qwen3.6-27B (released April 22, 2026) fits at Q4_K_M (16.8 GB) and is the best model at this tier.

Use case	Model	VRAM Q4	Best quant	tok/s (RTX 4070 Ti)
Best overall (NEW)	Qwen3.6-27B	16.8 GB	Q4_K_M	~38
Best coding / agentic	Qwen3.6-27B	16.8 GB	Q4_K_M	~38
General chat	Qwen 3.5 9B	~5.1 GB	Q8_0 (~10 GB)	~70
Coding (small / fast)	Qwen 3 Coder 14B	~8.3 GB	Q6_K (~12 GB)	~50
Previous-gen dense	Qwen 3.5 27B	~16 GB	Q4_K_M tight	~38
Instruction-follow	Llama 3.1 8B	~4.6 GB	Q8_0 (~8 GB)	~80
Reasoning / Math	DeepSeek R1 Distill 14B	~8 GB	Q5_K_M	~60

Does NOT fit at useful quant: Qwen3.6-35B-A3B MoE (~21 GB), Qwen 3.5 35B-A3B (~21 GB), DeepSeek R1 32B full, any Llama 4 variant.

Upgrade trigger: If you want MoE efficiency or long-context agentic workloads, jump to 24 GB.

24 GB VRAM — Enthusiast (RTX 4090, RTX 3090, RTX 5090 32GB, RX 7900 XTX, Mac M4 Pro 24GB)

The 2026 sweet spot. Handles the best MoE models, top dense 27-32B, long 262K context.

Use case	Model	VRAM Q4	Best quant	tok/s (RTX 4090)
Best coding / flagship (NEW)	Qwen3.6-27B dense	16.8 GB	Q6_K (22.5 GB)	~60
Best MoE throughput	Qwen3.6-35B-A3B	~21 GB	Q4_K_M	~70
Best coding (prev-gen)	Qwen 3 Coder 30B-A3B	~17 GB	Q4_K_M	~75
Dense reasoning	Qwen 3 32B	~19 GB	Q4_K_M	~55
Prev-gen MoE	Qwen 3.5 35B-A3B	~21 GB	Q4_K_M	~70
Math / Code	DeepSeek R1 Distill 32B	~19 GB	Q4_K_M	~50

Does NOT fit at useful quant: Llama 4 Maverick (requires 128GB), DeepSeek V3 full, Qwen 3.5 122B-A10B (needs 80GB).

Upgrade trigger: If you need Q6/Q8 on 35B-A3B (for coding precision) or 1M-context workflows, jump to 32 GB or Mac 36-64 GB.

32 GB VRAM — High-end consumer (RTX 5090 32GB)

Q5/Q6 on 35B-A3B, partial offload to Llama 4 Scout, Qwen 3.6 1M context.

Use case	Model	VRAM	Best quant	tok/s (RTX 5090)
Best coding (NEW)	Qwen3.6-27B dense	~28.6 GB	Q8_0	~85
Best MoE	Qwen3.6-35B-A3B	~25 GB	Q5_K_M	~90
Best prev-gen coding	Qwen 3 Coder 30B-A3B	~20 GB	Q6_K	~85
Long context	Qwen 3.5 27B (128K)	~18 GB	Q8_0	~55
Dense reasoning	Qwen 3 32B	~23 GB	Q5_K_M	~60
Partial offload	Llama 4 Scout 109B	~28 GB (partial)	Q4 offload	~15
1M-context (Qwen 3.6)	Qwen3.6-35B-A3B	~21-40 GB	Q4-Q5 YaRN	~90

Upgrade trigger: For 35B-A3B at Q8 (effectively FP16 quality) or multi-model concurrent use, go to 48 GB+ (RTX A6000, Mac M4 Max 64GB).

Apple Silicon unified memory equivalents

Unified memory behaves differently: macOS reserves ~15-25% for system. Effective "LLM headroom":

Mac config	Effective LLM RAM	Closest GPU tier
MacBook Air M4 16GB	~12 GB	RTX 4060 Ti 16GB
MacBook Air M4 24GB	~19 GB	between 16 and 24 GB tiers
MacBook Pro M4 24GB	~19 GB	between 16 and 24 GB
MacBook Pro M4 Pro 24GB	~20 GB	24 GB class (higher bandwidth)
MacBook Pro M4 Max 36GB	~30 GB	32 GB class
MacBook Pro M4 Max 48GB	~40 GB	32-48 GB class
MacBook Pro M4 Max 64GB	~54 GB	48 GB class
Mac Studio M3 Ultra 96GB	~80 GB	workstation

See MacBook Air M4 vs Pro M4 for Local LLMs for the full decision guide.

Expected tokens per second (Q4_K_M)

Model	RTX 4060 Ti 16GB	RTX 4090 24GB	RTX 5090 32GB	Mac M4 Pro 24GB	Mac M4 Max 64GB
Qwen 3 8B	50	85	110	40	45
Qwen 3.5 9B	50	85	110	40	45
Qwen 3 14B	35	55	70	22	28
Qwen 3.5 27B	—	35	45	18	24
Qwen 3 30B-A3B MoE	—	70	90	34	42
Qwen 3.5 35B-A3B	—	65	85	30	40
DeepSeek R1 Distill 32B	—	45	60	20	26

Decision framework

You type slowly or mostly chat: 16 GB is fine, stick with Qwen 3.5 9B Q8.
You code all day: 24 GB (RTX 4090/3090) + Qwen 3 Coder 30B-A3B. Best ROI.
You want MoE + long context: 32 GB RTX 5090 or Mac M4 Max 36-48 GB.
You run a team workstation: 48-64 GB (Mac Studio / Mac Pro M4 Max) for Q8 on 35B-A3B.
You run API for multiple users: skip consumer GPUs; go H100 80GB or datacenter multi-GPU (see Multi-GPU LLM Inference Guide).

Related guides

Qwen 3.6 VRAM & Release Date — latest flagship MoE
Qwen3.6-35B-A3B Hardware Requirements (Buyer Guide)
Best Local Coding LLMs for Apple Silicon 24GB
Best GPU for Running LLMs Locally (2026)
Best Local LLMs by VRAM Tier — 11 tiers ranked
VRAM Calculator — check any combo