Qwen 3.5 VRAM Requirements — 9B, 27B, 35B-A3B, 122B-A10B
Exact VRAM for Qwen 3.5 9B, 27B, 35B-A3B, and 122B-A10B at Q4, Q5, Q6, Q8, and FP16. Includes RTX 4090, Apple Silicon, and MLX vs GGUF guidance.
If you are searching for Qwen 3.5 9B, 27B, 35B-A3B, or 122B-A10B VRAM requirements, this is the dedicated guide.
Dedicated variant pages (deeper dive with per-GPU matrices and setup commands):
- Qwen 3.5 9B VRAM Requirements — best 8B-class, 5.5 GB at Q4
- Qwen 3.5 27B VRAM Requirements — dense flagship, 16.5 GB at Q4
- Qwen 3.5 35B-A3B VRAM Requirements — MoE, 21.4 GB at Q4
- Qwen 3.5 122B-A10B VRAM Requirements — workstation MoE, 74.4 GB at Q4
Quick answers for Qwen 3.5
- Qwen 3.5 9B: ~5.5 GB at Q4_K_M, ~9.6 GB at Q8_0
- Qwen 3.5 27B: ~16.5 GB at Q4_K_M, ~28.9 GB at Q8_0
- Qwen 3.5 35B-A3B: ~21.4 GB at Q4_K_M, ~37.5 GB at Q8_0
- Qwen 3.5 122B-A10B: ~74.4 GB at Q4_K_M, ~130.5 GB at Q8_0
Qwen 3.5 is the latest generation of open-weight models from Alibaba Cloud. Building on the strong foundation of Qwen 3, the 3.5 family pushes quality and efficiency further with improved training data, better instruction following, and refined Mixture of Experts (MoE) routing. The lineup spans from a compact 4B dense model to the massive 397B-A17B MoE flagship, covering every hardware tier from budget laptops to multi-GPU data center rigs.
This guide provides exact VRAM requirements for every Qwen 3.5 variant at all major quantization levels, along with hardware recommendations and Apple Silicon guidance.
If you need the original Qwen 3 lineup instead, read Qwen 3 VRAM Requirements.
Qwen 3.5 Model Family Overview
Alibaba's Qwen 3.5 release includes both dense and MoE architectures:
| Model | Type | Total Params | Active Params | Best For |
|---|---|---|---|---|
| Qwen3.5-4B | Dense | 4B | 4B | Edge devices, lightweight assistants |
| Qwen3.5-9B | Dense | 9B | 9B | Strong all-rounder, great value |
| Qwen3.5-27B | Dense | 27B | 27B | High-quality dense reasoning |
| Qwen3.5-35B-A3B | MoE | 35B | 3B | Best single-GPU efficiency |
| Qwen3.5-122B-A10B | MoE | 122B | 10B | Professional workstation, Mac Pro |
| Qwen3.5-397B-A17B | MoE | 397B | 17B | Frontier-class, multi-GPU only |
The MoE variants are the standout story. Qwen 3.5 35B-A3B activates only 3B parameters per token despite having 35B total, giving you dense-27B-class quality at 3B-class inference speed. The 122B-A10B and 397B-A17B follow the same principle at larger scale.
VRAM Requirements by Model and Quantization
These numbers represent model weight sizes calculated from calibrated per-parameter rates against real GGUF files. Add 1-2 GB for KV cache and runtime overhead at default context lengths.
| Model | Q4_K_M | Q5_K_M | Q6_K | Q8_0 | F16 |
|---|---|---|---|---|---|
| Qwen3.5-4B | 2.4 GB | 2.9 GB | 3.3 GB | 4.3 GB | 8.2 GB |
| Qwen3.5-9B | 5.5 GB | 6.5 GB | 7.4 GB | 9.6 GB | 18.5 GB |
| Qwen3.5-27B | 16.5 GB | 19.4 GB | 22.1 GB | 28.9 GB | 55.4 GB |
| Qwen3.5-35B-A3B (MoE) | 21.4 GB | 25.2 GB | 28.7 GB | 37.5 GB | 71.8 GB |
| Qwen3.5-122B-A10B (MoE) | 74.4 GB | 87.8 GB | 100.0 GB | 130.5 GB | 250.1 GB |
| Qwen3.5-397B-A17B (MoE) | 242.2 GB | 285.8 GB | 325.5 GB | 424.8 GB | 813.9 GB |
Key takeaway: The 35B-A3B MoE needs 21.4 GB at Q4 — it fits on a 24 GB RTX 4090 with room for a reasonable context window. Despite loading 35B parameters into memory, inference runs at the speed of a 3B dense model because only 3B parameters are active per token.
Best GPU for Each Qwen 3.5 Model
Qwen3.5-4B — Runs Almost Anywhere
At 2.4 GB (Q4), the 4B fits on virtually any modern GPU. Even a GTX 1060 6GB or Intel Arc A380 handles it at Q8. Great for always-on assistants, edge deployments, or quick prototyping.
Recommended hardware:
- Any GPU with 4 GB or more VRAM (RTX 3060, RTX 4060, GTX 1060)
- Mac M-series with 8 GB unified memory
- CPU-only inference with 8 GB or more system RAM
Quick start:
ollama run qwen3.5:4b
Qwen3.5-9B — Best Under 12GB
The 9B dense model is the sweet spot of the Qwen 3.5 lineup. It delivers strong instruction-following, multilingual, and coding performance while staying under 10 GB at Q8.
Recommended hardware:
- RTX 4060 8GB — fits at Q5_K_M with moderate headroom
- RTX 4070 12GB — comfortable at Q6, excellent performance
- RTX 3060 12GB — fits at Q6 with careful context limits
Quick start:
ollama run qwen3.5:9b
Qwen3.5-27B — The 24GB GPU Sweet Spot
The 27B dense model is the largest Qwen 3.5 that fits on a single consumer GPU at Q4. At 16.5 GB it leaves about 7 GB of headroom on a 24 GB card — enough for standard context lengths. Quality is a clear step up from the 9B, particularly for complex reasoning and long-form generation.
Recommended hardware:
- RTX 4090 24GB — fits at Q4 with good headroom
- RTX 5090 32GB — comfortable at Q5, room for large context
- RTX 3090 24GB — fits at Q4, slower but functional
Quick start:
ollama run qwen3.5:27b
Check compatibility: Can Qwen 3.5 27B run on RTX 4090?
Qwen3.5-35B-A3B (MoE) — Best Efficiency on 24GB
This is the model to watch. Despite having 35B total parameters, the MoE architecture activates only 3B per token. The result: you load 21.4 GB into VRAM at Q4, but inference is as fast as a 3B dense model while quality approaches that of a much larger dense network.
At Q4, it fits on a 24 GB GPU. Inference speed is excellent because the active compute footprint is tiny.
Recommended hardware:
- RTX 4090 24GB — fits at Q4 with ~2.5 GB headroom
- RTX 5090 32GB — fits at Q5 with generous context window
- Mac M4 Max 36GB — comfortable at Q4, unified memory advantage
Quick start:
ollama run qwen3.5:35b-a3b
Qwen3.5-122B-A10B (MoE) — Workstation and Apple Silicon Territory
With 74.4 GB at Q4, the 122B-A10B requires either professional GPUs or high-memory Apple Silicon. The payoff is substantial: 10B active parameters deliver quality that competes with frontier dense models, while inference speed remains practical.
Recommended hardware:
- A100 80GB — fits at Q4 on a single GPU
- H100 80GB — fits at Q4 with fast inference
- Mac M4 Max 128GB — fits at Q4 with room to spare
- Mac M4 Ultra 192GB — comfortable at Q5
Qwen3.5-397B-A17B (MoE) — Multi-GPU or Ultra-High-Memory Mac
The flagship model needs 242 GB at Q4. This is firmly in multi-GPU or Apple Silicon Ultra territory. With 17B active parameters per token, the 397B-A17B delivers frontier-class quality while remaining feasible on high-end local hardware.
Recommended hardware:
- H100 80GB x 4 GPUs — fits at Q4 with tensor parallelism
- A100 80GB x 4 GPUs — fits at Q4
- MI300X 192GB x 2 GPUs — fits at Q4
- Apple M3 Ultra 512GB (custom config) — fits at Q4 with unified memory
Apple Silicon Guide — Qwen 3.5 on Mac
Apple Silicon's unified memory architecture makes it one of the best platforms for running large models locally. Here is what each Mac configuration can handle:
| Mac Configuration | Best Qwen 3.5 Model | Quantization | Notes |
|---|---|---|---|
| M4 16GB | Qwen3.5-4B | Q8_0 (4.3 GB) | Comfortable with headroom |
| M4 16GB | Qwen3.5-9B | Q4_K_M (5.5 GB) | Tight — leaves ~10 GB for OS and context |
| M4 Pro 24GB | Qwen3.5-9B | Q8_0 (9.6 GB) | Comfortable fit |
| M4 Pro 24GB | Qwen3.5-27B | Q4_K_M (16.5 GB) | Tight but functional |
| M4 Max 64GB | Qwen3.5-27B | F16 (55.4 GB) | Full precision, excellent quality |
| M4 Max 64GB | Qwen3.5-35B-A3B | Q4_K_M (21.4 GB) | Comfortable with large context |
| M4 Max 128GB | Qwen3.5-122B-A10B | Q4_K_M (74.4 GB) | Fits with headroom for context |
| M4 Ultra 192GB | Qwen3.5-122B-A10B | Q5_K_M (87.8 GB) | Generous headroom |
MLX vs GGUF on Mac
Mac users have two main options for running Qwen 3.5 locally:
- MLX (via
mlx-lmor LM Studio): Apple's native ML framework. Optimized for Apple Silicon with minimal overhead. Best performance and memory efficiency on Mac. Look formlx-community/Qwen3.5-*models on Hugging Face. - GGUF (via llama.cpp or Ollama): Cross-platform format. Slightly higher memory overhead on Mac than MLX but broader compatibility. Works well through Ollama's simple CLI.
For most Mac users, MLX is the better choice — it takes full advantage of the unified memory architecture and Metal GPU acceleration. GGUF via Ollama is the easier setup if you just want to get started quickly.
Explore Mac compatibility for all models on our Mac hardware pages.
Choosing the Right Quantization
Qwen 3.5 handles quantization well across the board. Here is a quick guide:
| VRAM Budget | Recommended Quant | Quality Impact |
|---|---|---|
| Very tight (4 GB or less) | Q4_K_M | Minimal loss for chat and general tasks |
| Normal (4-12 GB) | Q5_K_M | Good balance of quality and size |
| Comfortable (12-24 GB) | Q6_K | Near-lossless for most tasks |
| Generous (24 GB and above) | Q8_0 | Effectively identical to full precision |
For coding and structured output tasks, prefer Q5_K_M or higher. For casual conversation and summarization, Q4_K_M is perfectly adequate.
Read our quantization guide for a deeper look at how quant levels affect output quality.
Getting Started
- Auto-detect your hardware: Use Hardware Detection to see which Qwen 3.5 variant fits your GPU or Mac automatically
- Check your fit: Use the VRAM calculator to see which Qwen 3.5 variant matches your hardware
- Install Ollama:
curl -fsSL https://ollama.com/install.sh | sh - Run your model:
# Fast and light ollama run qwen3.5:9b # Best efficiency on 24GB GPU (MoE) ollama run qwen3.5:35b-a3b # High-quality dense ollama run qwen3.5:27b - Browse all Qwen models on WillItRunAI: Browse Qwen models