Will It Run AI
llama, meta, vram, gpu-requirements

Llama 4 VRAM Requirements — Scout 109B & Maverick 400B (GPU & Mac Guide)

Llama 4 Scout needs ~61GB at Q4, Maverick needs ~224GB. Complete VRAM tables for every quantization level with GPU and Mac hardware recommendations.

Meta's Llama 4 represents a major architectural shift from the dense Llama 3 family to a Mixture of Experts (MoE) design. While this delivers better quality per compute, it also means larger total model sizes that challenge local hardware.

Llama 4 Model Family

ModelTotal ParamsActive ParamsArchitectureExperts
Llama 4 Scout109B17BMoE16 experts, 1 active
Llama 4 Maverick400B17BMoE128 experts, 1 active
Llama 4 Behemoth~2T (est.)~100B (est.)MoEUnreleased

All Llama 4 models use the same 17B active parameter count for inference, meaning token generation speed is similar. The difference is quality — more experts means more specialized knowledge.

VRAM Requirements

ModelQ3_K_MQ4_K_MQ5_K_MQ6_KQ8_0
Llama 4 Scout (109B)48 GB61 GB75 GB88 GB116 GB
Llama 4 Maverick (400B)176 GB224 GB276 GB324 GB424 GB

Important: MoE models must load ALL parameters into memory, even though only 17B are active per token. This is why the VRAM requirements are much higher than a 17B dense model.

Hardware Recommendations

Llama 4 Scout (109B MoE)

Scout is the accessible Llama 4 variant, though "accessible" still means high-end hardware.

Consumer options:

Datacenter options:

  • A100 80GB — fits at Q4 (barely) on a single card
  • H100 80GB — same, with faster bandwidth
  • RTX 4090 24GB with tensor parallelism (48GB total, needs Q2)

Quick start:

ollama run llama4-scout

Llama 4 Maverick (400B MoE)

Maverick requires serious hardware — even at Q4 it needs 224GB.

Recommended hardware:

When to Use Llama 3 Instead

For most consumer hardware, the Llama 3 family remains more practical:

Your VRAMBest Llama Option
8 GBLlama 3.2 3B
12 GBLlama 3.1 8B at Q6+
24 GBLlama 3.1 8B at Q8 (fast)
32 GBLlama 3.3 70B at Q3 with offload
64 GB+Llama 3.3 70B at Q5+ or Llama 4 Scout at Q3
128 GB+Llama 4 Scout at Q4+

Understanding MoE Architecture

Llama 4's Mixture of Experts design works differently from Llama 3's dense models:

  • Dense models (Llama 3): Every parameter is used for every token. A 70B model uses 70B parameters per forward pass.
  • MoE models (Llama 4): Only a subset of "expert" layers activate per token. Scout has 16 experts but routes to 1, so only 17B parameters compute each token.

The trade-off: MoE models deliver higher quality per compute cycle, but ALL expert weights must reside in memory. You get 70B-class quality from 17B active parameters, but need 109B-worth of VRAM.

For more about how this affects quantization, read our quantization guide.

Quantization Recommendations

Since Llama 4 models are already large, every bit of compression helps:

  • Plenty of memory (>1.5× model size): Q5_K_M — great quality
  • Tight fit: Q4_K_M — the standard choice
  • Need to squeeze: Q3_K_M — noticeable quality drop but functional
  • Emergency: Q2_K — significant degradation, last resort

MoE models are somewhat more tolerant of quantization because the routing mechanism is less sensitive to precision than the expert weights themselves.

Performance Expectations

Despite the large total parameter count, Llama 4 Scout generates tokens at speeds comparable to a dense 17B model (since only 17B parameters are active):

HardwareScout (Q4)Notes
Mac M4 Max 128GB~15-20 tok/sUnified memory bandwidth limited
A100 80GB~35-45 tok/sHigh bandwidth, fits at Q4
H100 80GB~50-65 tok/sBest single-GPU option

Getting Started

  1. Check your hardware — see if Scout fits
  2. If Scout doesn't fit, Llama 3.1 8B or Llama 3.3 70B are excellent alternatives
  3. Install Ollama: curl -fsSL https://ollama.com/install.sh | sh
  4. Run: ollama run llama4-scout

Related Guides

Frequently Asked Questions

How much VRAM does Llama 4 need?

Llama 4 Scout (17B active, 109B total MoE) needs ~61GB at Q4_K_M. Llama 4 Maverick (17B active, 400B total) needs ~224GB at Q4. Scout fits on a Mac M4 Max 128GB or dual datacenter GPUs. Maverick requires multi-GPU setups.

Can I run Llama 4 on an RTX 4090?

Llama 4 Scout doesn't fit natively on a single RTX 4090 (24GB) — it needs ~61GB at Q4. However, with aggressive quantization (Q2) and significant CPU offloading, it's technically possible but very slow. Consider Llama 3.3 70B or Llama 3.1 8B instead.

What's the difference between Llama 4 Scout and Maverick?

Both are Mixture of Experts models with 17B active parameters. Scout has 109B total parameters (16 experts, 1 active), while Maverick has 400B total (128 experts, 1 active). Maverick is more capable but requires significantly more memory.

Is Llama 4 better than Llama 3?

Llama 4 uses a MoE architecture unlike Llama 3's dense models. This means higher quality per active parameter but larger total model size. For most consumer hardware, Llama 3.1 8B or Llama 3.3 70B remain more practical choices.

Can I run Llama 4 on a Mac?

A Mac M4 Max with 128GB can run Llama 4 Scout at Q3-Q4 using unified memory. A Mac M4 Ultra with 192GB handles Scout comfortably at Q4+. Maverick requires 192GB+ and heavy quantization.