How much VRAM does Llama 4 need?

Llama 4 Scout (17B active, 109B total MoE) needs ~61GB at Q4_K_M. Llama 4 Maverick (17B active, 400B total) needs ~224GB at Q4. Scout fits on a Mac M4 Max 128GB or dual datacenter GPUs. Maverick requires multi-GPU setups.

Can I run Llama 4 on an RTX 4090?

Llama 4 Scout doesn't fit natively on a single RTX 4090 (24GB) — it needs ~61GB at Q4. However, with aggressive quantization (Q2) and significant CPU offloading, it's technically possible but very slow. Consider Llama 3.3 70B or Llama 3.1 8B instead.

What's the difference between Llama 4 Scout and Maverick?

Both are Mixture of Experts models with 17B active parameters. Scout has 109B total parameters (16 experts, 1 active), while Maverick has 400B total (128 experts, 1 active). Maverick is more capable but requires significantly more memory.

Is Llama 4 better than Llama 3?

Llama 4 uses a MoE architecture unlike Llama 3's dense models. This means higher quality per active parameter but larger total model size. For most consumer hardware, Llama 3.1 8B or Llama 3.3 70B remain more practical choices.

Can I run Llama 4 on a Mac?

A Mac M4 Max with 128GB can run Llama 4 Scout at Q3-Q4 using unified memory. A Mac M4 Ultra with 192GB handles Scout comfortably at Q4+. Maverick requires 192GB+ and heavy quantization.

March 17, 2026llama, meta, vram, gpu-requirements

Llama 4 VRAM Requirements — Scout 109B & Maverick 400B (GPU & Mac Guide)

Llama 4 Scout needs ~61GB at Q4, Maverick needs ~224GB. Complete VRAM tables for every quantization level with GPU and Mac hardware recommendations.

Meta's Llama 4 represents a major architectural shift from the dense Llama 3 family to a Mixture of Experts (MoE) design. While this delivers better quality per compute, it also means larger total model sizes that challenge local hardware.

Llama 4 Model Family

Model	Total Params	Active Params	Architecture	Experts
Llama 4 Scout	109B	17B	MoE	16 experts, 1 active
Llama 4 Maverick	400B	17B	MoE	128 experts, 1 active
Llama 4 Behemoth	~2T (est.)	~100B (est.)	MoE	Unreleased

All Llama 4 models use the same 17B active parameter count for inference, meaning token generation speed is similar. The difference is quality — more experts means more specialized knowledge.

VRAM Requirements

Model	Q3_K_M	Q4_K_M	Q5_K_M	Q6_K	Q8_0
Llama 4 Scout (109B)	48 GB	61 GB	75 GB	88 GB	116 GB
Llama 4 Maverick (400B)	176 GB	224 GB	276 GB	324 GB	424 GB

Important: MoE models must load ALL parameters into memory, even though only 17B are active per token. This is why the VRAM requirements are much higher than a 17B dense model.

Hardware Recommendations

Llama 4 Scout (109B MoE)

Scout is the accessible Llama 4 variant, though "accessible" still means high-end hardware.

Consumer options:

Mac M4 Max 128GB — fits at Q4 (~61GB needed, ~92GB usable)
Mac M4 Ultra 192GB — comfortable at Q5+

Datacenter options:

A100 80GB — fits at Q4 (barely) on a single card
H100 80GB — same, with faster bandwidth
2× RTX 4090 24GB with tensor parallelism (48GB total, needs Q2)

Quick start:

ollama run llama4-scout

Llama 4 Maverick (400B MoE)

Maverick requires serious hardware — even at Q4 it needs 224GB.

Recommended hardware:

MI300X 192GB × 2 (384GB total)
A100 80GB × 4 (320GB total)
H100 80GB × 4 (320GB total)
Mac M4 Ultra 192GB at Q2-Q3 with offloading (very slow)

When to Use Llama 3 Instead

For most consumer hardware, the Llama 3 family remains more practical:

Your VRAM	Best Llama Option
8 GB	Llama 3.2 3B
12 GB	Llama 3.1 8B at Q6+
24 GB	Llama 3.1 8B at Q8 (fast)
32 GB	Llama 3.3 70B at Q3 with offload
64 GB+	Llama 3.3 70B at Q5+ or Llama 4 Scout at Q3
128 GB+	Llama 4 Scout at Q4+

Understanding MoE Architecture

Llama 4's Mixture of Experts design works differently from Llama 3's dense models:

Dense models (Llama 3): Every parameter is used for every token. A 70B model uses 70B parameters per forward pass.
MoE models (Llama 4): Only a subset of "expert" layers activate per token. Scout has 16 experts but routes to 1, so only 17B parameters compute each token.

The trade-off: MoE models deliver higher quality per compute cycle, but ALL expert weights must reside in memory. You get 70B-class quality from 17B active parameters, but need 109B-worth of VRAM.

For more about how this affects quantization, read our quantization guide.

Quantization Recommendations

Since Llama 4 models are already large, every bit of compression helps:

Plenty of memory (>1.5× model size): Q5_K_M — great quality
Tight fit: Q4_K_M — the standard choice
Need to squeeze: Q3_K_M — noticeable quality drop but functional
Emergency: Q2_K — significant degradation, last resort

MoE models are somewhat more tolerant of quantization because the routing mechanism is less sensitive to precision than the expert weights themselves.

Performance Expectations

Despite the large total parameter count, Llama 4 Scout generates tokens at speeds comparable to a dense 17B model (since only 17B parameters are active):

Hardware	Scout (Q4)	Notes
Mac M4 Max 128GB	~15-20 tok/s	Unified memory bandwidth limited
A100 80GB	~35-45 tok/s	High bandwidth, fits at Q4
H100 80GB	~50-65 tok/s	Best single-GPU option

Getting Started

Check your hardware — see if Scout fits
If Scout doesn't fit, Llama 3.1 8B or Llama 3.3 70B are excellent alternatives
Install Ollama: curl -fsSL https://ollama.com/install.sh | sh
Run: ollama run llama4-scout