Llama 4 VRAM Requirements — Scout 109B & Maverick 400B (GPU & Mac Guide)
Llama 4 Scout needs ~61GB at Q4, Maverick needs ~224GB. Complete VRAM tables for every quantization level with GPU and Mac hardware recommendations.
Meta's Llama 4 represents a major architectural shift from the dense Llama 3 family to a Mixture of Experts (MoE) design. While this delivers better quality per compute, it also means larger total model sizes that challenge local hardware.
Llama 4 Model Family
| Model | Total Params | Active Params | Architecture | Experts |
|---|---|---|---|---|
| Llama 4 Scout | 109B | 17B | MoE | 16 experts, 1 active |
| Llama 4 Maverick | 400B | 17B | MoE | 128 experts, 1 active |
| Llama 4 Behemoth | ~2T (est.) | ~100B (est.) | MoE | Unreleased |
All Llama 4 models use the same 17B active parameter count for inference, meaning token generation speed is similar. The difference is quality — more experts means more specialized knowledge.
VRAM Requirements
| Model | Q3_K_M | Q4_K_M | Q5_K_M | Q6_K | Q8_0 |
|---|---|---|---|---|---|
| Llama 4 Scout (109B) | 48 GB | 61 GB | 75 GB | 88 GB | 116 GB |
| Llama 4 Maverick (400B) | 176 GB | 224 GB | 276 GB | 324 GB | 424 GB |
Important: MoE models must load ALL parameters into memory, even though only 17B are active per token. This is why the VRAM requirements are much higher than a 17B dense model.
Hardware Recommendations
Llama 4 Scout (109B MoE)
Scout is the accessible Llama 4 variant, though "accessible" still means high-end hardware.
Consumer options:
- Mac M4 Max 128GB — fits at Q4 (~61GB needed, ~92GB usable)
- Mac M4 Ultra 192GB — comfortable at Q5+
Datacenter options:
- A100 80GB — fits at Q4 (barely) on a single card
- H100 80GB — same, with faster bandwidth
- 2× RTX 4090 24GB with tensor parallelism (48GB total, needs Q2)
Quick start:
ollama run llama4-scout
Llama 4 Maverick (400B MoE)
Maverick requires serious hardware — even at Q4 it needs 224GB.
Recommended hardware:
- MI300X 192GB × 2 (384GB total)
- A100 80GB × 4 (320GB total)
- H100 80GB × 4 (320GB total)
- Mac M4 Ultra 192GB at Q2-Q3 with offloading (very slow)
When to Use Llama 3 Instead
For most consumer hardware, the Llama 3 family remains more practical:
| Your VRAM | Best Llama Option |
|---|---|
| 8 GB | Llama 3.2 3B |
| 12 GB | Llama 3.1 8B at Q6+ |
| 24 GB | Llama 3.1 8B at Q8 (fast) |
| 32 GB | Llama 3.3 70B at Q3 with offload |
| 64 GB+ | Llama 3.3 70B at Q5+ or Llama 4 Scout at Q3 |
| 128 GB+ | Llama 4 Scout at Q4+ |
Understanding MoE Architecture
Llama 4's Mixture of Experts design works differently from Llama 3's dense models:
- Dense models (Llama 3): Every parameter is used for every token. A 70B model uses 70B parameters per forward pass.
- MoE models (Llama 4): Only a subset of "expert" layers activate per token. Scout has 16 experts but routes to 1, so only 17B parameters compute each token.
The trade-off: MoE models deliver higher quality per compute cycle, but ALL expert weights must reside in memory. You get 70B-class quality from 17B active parameters, but need 109B-worth of VRAM.
For more about how this affects quantization, read our quantization guide.
Quantization Recommendations
Since Llama 4 models are already large, every bit of compression helps:
- Plenty of memory (>1.5× model size): Q5_K_M — great quality
- Tight fit: Q4_K_M — the standard choice
- Need to squeeze: Q3_K_M — noticeable quality drop but functional
- Emergency: Q2_K — significant degradation, last resort
MoE models are somewhat more tolerant of quantization because the routing mechanism is less sensitive to precision than the expert weights themselves.
Performance Expectations
Despite the large total parameter count, Llama 4 Scout generates tokens at speeds comparable to a dense 17B model (since only 17B parameters are active):
| Hardware | Scout (Q4) | Notes |
|---|---|---|
| Mac M4 Max 128GB | ~15-20 tok/s | Unified memory bandwidth limited |
| A100 80GB | ~35-45 tok/s | High bandwidth, fits at Q4 |
| H100 80GB | ~50-65 tok/s | Best single-GPU option |
Getting Started
- Check your hardware — see if Scout fits
- If Scout doesn't fit, Llama 3.1 8B or Llama 3.3 70B are excellent alternatives
- Install Ollama:
curl -fsSL https://ollama.com/install.sh | sh - Run:
ollama run llama4-scout