Qwen 3 GPU Requirements — Original Family (0.6B–235B) VRAM Guide (2026)
VRAM tables for the original Qwen 3 family (0.6B to 235B-A22B), with GPU and Mac recommendations. For the newer Qwen 3.5 and Qwen 3.6 generations, see the dedicated pages linked below.
This page covers the original Qwen 3 family (released 2025) and the Qwen 3.5 refresh: dense and MoE models from 0.6B up to 235B-A22B. The newest Qwen 3.6 generation has its own dedicated guides — this page no longer covers 3.6 specs.
Running Qwen 3.6? It's a separate generation with its own VRAM math. Go straight to the dedicated guides: Qwen 3.6 27B (dense) → and Qwen 3.6 35B-A3B (MoE) →.
Skip to your size — Qwen 3 / 3.5 jump table
| If you searched for… | Q4_K_M VRAM | Dedicated guide |
|---|---|---|
| qwen 3.5 9b | ~5.7 GB | Qwen 3.5 9B VRAM |
| qwen 3.5 27b | ~16.5 GB | Qwen 3.5 27B VRAM |
| qwen 3.5 35b-a3b | ~19.6 GB | Qwen 3.5 35B-A3B VRAM |
| qwen 3.5 122b-a10b | ~74 GB | Qwen 3.5 122B-A10B VRAM |
| qwen 3 8B / 14B / 30B-A3B / 32B / 235B-A22B (original) | see tables below | this page |
Quick answers for the original Qwen 3
- Qwen 3 8B: ~4.6 GB at Q4_K_M, ~8.5 GB at Q8_0
- Qwen 3 14B: ~8.3 GB at Q4_K_M, ~15.7 GB at Q8_0
- Qwen 3 30B-A3B: ~16.8 GB at Q4_K_M, ~31.6 GB at Q8_0
- Qwen 3 32B: ~19.1 GB at Q4_K_M, ~36.1 GB at Q8_0
- Qwen 3 235B-A22B: ~132 GB at Q4_K_M
Don't see your variant above? Try the VRAM Calculator — paste any model + GPU/Mac and get an exact fit verdict in seconds.
Qwen 3 is Alibaba's open-weight foundation family. It competes with Llama 4 and DeepSeek V3, offering a wide range of sizes — from compact 0.6B models that run on a phone to the flagship 235B MoE. The lineup combines dense models (efficient, predictable VRAM) with Mixture of Experts models (MoE, more capable per byte of VRAM).
Qwen 3 Model Family
Alibaba released Qwen 3 as a complete lineup covering every hardware tier:
| Model | Type | Parameters | Active Params | Best For |
|---|---|---|---|---|
| Qwen 3 0.6B | Dense | 0.6B | 0.6B | Edge devices, always-on agents |
| Qwen 3 1.7B | Dense | 1.7B | 1.7B | Lightweight local assistants |
| Qwen 3 4B | Dense | 4B | 4B | Mid-range phones, low-VRAM desktops |
| Qwen 3 8B | Dense | 8B | 8B | Flagship small model, great all-rounder |
| Qwen 3 14B | Dense | 14B | 14B | Mid-range performance, strong reasoning |
| Qwen 3 30B-A3B | MoE | 30B | 3B | Best efficiency, MoE flagship |
| Qwen 3 32B | Dense | 32B | 32B | High-end dense, maximum dense quality |
| Qwen 3 235B-A22B | MoE | 235B | 22B | Flagship MoE, frontier-class quality |
| Qwen 3 Coder 8B | Dense | 8B | 8B | Coding-optimized small model |
| Qwen 3 Coder 14B | Dense | 14B | 14B | Coding mid-range |
| Qwen 3 Coder 30B-A3B | MoE | 30B | 3B | Best coding efficiency |
The Coder variants share architecture with the base models but are fine-tuned specifically on programming tasks for better results on code generation, debugging, and technical documentation.
VRAM Requirements by Variant
Exact VRAM at different quantization levels for the original Qwen 3 family:
| Variant | Q4_K_M | Q5_K_M | Q6_K | Q8_0 | F16 |
|---|---|---|---|---|---|
| Qwen 3 0.6B | 0.5 GB | 0.6 GB | 0.7 GB | 0.9 GB | 1.3 GB |
| Qwen 3 1.7B | 1.1 GB | 1.3 GB | 1.5 GB | 1.9 GB | 3.4 GB |
| Qwen 3 4B | 2.5 GB | 3.0 GB | 3.6 GB | 4.6 GB | 8.0 GB |
| Qwen 3 8B | 4.6 GB | 5.6 GB | 6.6 GB | 8.5 GB | 16.1 GB |
| Qwen 3 14B | 8.3 GB | 10.2 GB | 12.0 GB | 15.7 GB | 28.0 GB |
| Qwen 3 30B-A3B | 16.8 GB | 20.6 GB | 24.2 GB | 31.6 GB | 60.0 GB |
| Qwen 3 32B | 19.1 GB | 23.5 GB | 27.6 GB | 36.1 GB | 64.4 GB |
| Qwen 3 235B-A22B | 131.9 GB | 162.2 GB | 190.5 GB | 249.0 GB | 470.0 GB |
Add ~1-2 GB for KV cache and runtime overhead at default context lengths.
Hardware Recommendations
Qwen 3 0.6B and 1.7B — Anywhere and Everywhere
These micro-models are designed for constrained environments. They fit in less than 2 GB of VRAM, making them viable on integrated graphics, older GPUs, or even CPU-only inference.
Recommended hardware:
- Any GPU with 4GB+ VRAM (even GTX 1650 4GB)
- Mac M-series with 8GB unified memory
- Modern CPUs with 8GB+ RAM (llama.cpp CPU mode)
Quick start:
ollama run qwen3:0.6b
ollama run qwen3:1.7b
Qwen 3 4B — Respectable Quality, Minimal Hardware
The 4B model delivers surprisingly capable responses for its size. At Q4, it needs only ~2.5 GB — perfect for 4-6 GB GPUs or low-memory Macs.
Recommended hardware:
- RTX 3050 6GB / RTX 4060 8GB — fits at Q8 with headroom
- Mac M2/M3 with 8GB unified memory
- Intel Arc A770 16GB — excellent efficiency
Quick start:
ollama run qwen3:4b
Qwen 3 8B — Best Small Model
The 8B is the sweet spot of the Qwen 3 dense lineup. It punches well above its weight class on instruction following, coding, and multilingual tasks — particularly strong on Chinese and Japanese.
Recommended hardware:
- RTX 4060 8GB — fits at Q4_K_M with minimal headroom; Q6 possible with careful context limits
- RTX 4070 12GB — comfortable at Q6, excellent performance-per-watt
- RTX 4070 Ti Super 16GB — fits at Q8 with room for large context
- Any Mac with 16GB+ unified memory
Quick start:
ollama run qwen3:8b
Check compatibility: Qwen 3 8B on RTX 4070 | Qwen 3 8B on RTX 4060
Qwen 3 14B — Best Under 16GB VRAM
The 14B hits a quality tier noticeably above the 8B, especially for complex reasoning and coding tasks. At Q4 it needs ~8.3 GB, making it accessible on most mainstream GPUs.
Recommended hardware:
- RTX 4070 12GB — fits at Q4_K_M; tight but functional
- RTX 4070 Ti Super 16GB — comfortable at Q5, strong throughput
- RTX 4080 Super 16GB — excellent at Q6+
- Mac M4 Pro 24GB — fits Q8 comfortably, unified memory advantage
Quick start:
ollama run qwen3:14b
Check compatibility: Qwen 3 14B on RTX 4070 Ti Super
Qwen 3 30B-A3B — The MoE Efficiency Champion
This is one of the most interesting models in the Qwen 3 family. The 30B-A3B is a Mixture of Experts model with 30B total parameters but only 3B active per token. That means inference is as fast as a 3B dense model while the quality rivals much larger dense models.
At Q4, the 30B-A3B needs ~17 GB — fitting comfortably on a 24GB GPU.
Recommended hardware:
- RTX 4090 24GB — perfect fit at Q4, fast inference
- RTX 5090 32GB — Q5+ with plenty of context headroom
- Mac M4 Max 36GB — comfortable at Q5, excellent efficiency
- Mac M4 Pro 24GB — fits at Q4 with good performance
Quick start:
ollama run qwen3:30b-a3b
Check compatibility: Qwen 3 30B-A3B on RTX 4090 | Qwen 3 30B-A3B on RTX 5090
Qwen 3 32B — Maximum Dense Quality
The dense 32B is the largest Qwen 3 model that doesn't use MoE. It delivers the highest quality dense inference in the family. At Q4 it needs ~19 GB — just over what a 24GB GPU can hold comfortably, so tight configurations will require some context length management.
Recommended hardware:
- RTX 4090 24GB — fits at Q4 with minimal overhead; disable KV cache extensions
- RTX 5090 32GB — comfortable at Q4, room for large context
- Mac M4 Max 36GB — fits at Q5, excellent for long documents
- Mac M4 Max 64GB — Q6+ runs smoothly with full context
Quick start:
ollama run qwen3:32b
Check compatibility: Qwen 3 32B on RTX 5090
Qwen 3 235B-A22B — Frontier-Class Performance
The flagship MoE model. With 235B total parameters and 22B active per token, this is the most capable model in the original Qwen 3 lineup and competes with frontier proprietary models on many benchmarks.
Recommended hardware:
- H100 80GB × 2 GPUs — fits at Q4, excellent throughput
- A100 80GB × 2-4 GPUs
- MI300X 192GB — fits at Q4 on a single GPU
- Mac M4 Ultra 192GB — fits at Q4 with memory to spare
Quick start:
ollama run qwen3:235b-a22b
Newer Qwen generations — dedicated guides
Qwen 3.5 (late 2025 refresh) adds new sizes (2B, 9B, 27B dense; 35B-A3B, 122B-A10B, 397B-A17B MoE) with improved tuning. Qwen 3.6 (April 2026) introduces the 1M-token native context and a flagship-class dense 27B. Because the VRAM math and hardware picks differ per variant, each has its own page:
- Qwen 3.5 Complete Guide
- Qwen 3.5 9B VRAM
- Qwen 3.5 27B VRAM
- Qwen 3.5 35B-A3B VRAM
- Qwen 3.5 122B-A10B VRAM
- Qwen 3.6 27B VRAM & Hardware Requirements — the dense coding flagship
- Qwen 3.6 VRAM & Hardware Requirements (35B-A3B MoE)
- Qwen 3.6 35B-A3B Release Date
Understanding Qwen 3 MoE Variants
Mixture of Experts (MoE) is a key architectural innovation in Qwen 3. In a standard dense model, every parameter is used for every token. In a MoE model, the network is divided into "expert" subnetworks, and only a small fraction are activated per token.
Qwen 3 30B-A3B in practice:
- Total parameters: 30B (must fit in VRAM)
- Active parameters per token: 3B (determines inference speed)
- Result: You load ~17 GB into VRAM, but inference runs at the speed of a 3B model
This is why the 30B-A3B can outperform a dense 14B model while running at comparable speeds. The routing mechanism selects the most relevant experts for each token, concentrating compute where it matters.
Qwen 3 235B-A22B goes further: 235B total in memory, 22B active per token (comparable to a mid-size dense model) — frontier-level quality at workstation inference speeds.
The trade-off: MoE models use more total VRAM than their active parameter count suggests. You pay for capacity in memory, and you get quality + speed in return. Our VRAM calculator accounts for MoE architecture when estimating fit.
Qwen 3 for Coding
The Qwen 3 Coder variants are fine-tuned on massive programming datasets. Performance differences versus the base models are most pronounced on:
- Code generation: Larger, more complex functions from natural language
- Bug finding and fixing: Static analysis-style reasoning over code
- Repo-level tasks: Multi-file context and refactoring
- Technical documentation: Accurate docstrings and API descriptions
Qwen 3 Coder 8B
Excellent for everyday coding assistance on constrained hardware. Uses the same VRAM as Qwen 3 8B (~4.6 GB at Q4). A solid choice for developers on RTX 4060 8GB or similar.
ollama run qwen3-coder:8b
Qwen 3 Coder 14B
The best coding model under 16 GB VRAM. Noticeably stronger than the 8B Coder on longer functions and multi-file reasoning.
ollama run qwen3-coder:14b
Qwen 3 Coder 30B-A3B
The most capable coding model for single-GPU setups. The MoE architecture gives it quality close to the 32B dense model while fitting in ~17 GB. If you have a 24GB GPU and write code for a living, this is the model to run.
ollama run qwen3-coder:30b-a3b
Check compatibility: Qwen 3 Coder 30B-A3B on RTX 4090
Choosing the Right Quantization
Unlike reasoning-heavy models such as DeepSeek R1, Qwen 3's base and Coder variants handle quantization gracefully. You can drop to Q4 without dramatic quality loss for most tasks.
General guidance:
| VRAM Budget | Recommended Quant | Notes |
|---|---|---|
| Very tight (≤4 GB) | Q4_K_M | Functional, minimal headroom |
| Normal (4–12 GB) | Q5_K_M | Good quality-size balance |
| Comfortable (12–24 GB) | Q6_K | Near-lossless for most tasks |
| Generous (24 GB+) | Q8_0 | Effectively identical to F16 |
For coding tasks, we recommend Q5_K_M or higher — code generation benefits from precision, especially for syntax-sensitive outputs. For casual chat and summarization, Q4_K_M is fine.
Read our quantization guide for a deeper look at how different quant levels affect output quality.
Qwen 3 vs Other Leading Open Models
How does Qwen 3 stack up against the competition on hardware requirements?
| Model | Params | VRAM (Q4) | Active Params | Architecture |
|---|---|---|---|---|
| Qwen 3 8B | 8B | 4.6 GB | 8B | Dense |
| Llama 4 Scout | 109B | ~59 GB | 17B | MoE |
| DeepSeek V3 | 671B | ~376 GB | 37B | MoE |
| Qwen 3 30B-A3B | 30B | 16.8 GB | 3B | MoE |
| QwQ 32B | 32B | 18 GB | 32B | Dense |
| Qwen 3 235B-A22B | 235B | 132 GB | 22B | MoE |
The 30B-A3B is especially compelling: it sits in a hardware tier similar to QwQ 32B but runs inference at the speed of a 3B model thanks to MoE activation sparsity.
Performance Expectations
Inference speed depends on your hardware's memory bandwidth. Approximate token generation speeds with Q4_K_M:
| Hardware | Qwen 3 8B | Qwen 3 14B | Qwen 3 30B-A3B |
|---|---|---|---|
| RTX 4060 8GB | ~50 tok/s | — | — |
| RTX 4070 12GB | ~60 tok/s | ~35 tok/s | — |
| RTX 4090 24GB | ~85 tok/s | ~55 tok/s | ~70 tok/s* |
| RTX 5090 32GB | ~110 tok/s | ~70 tok/s | ~90 tok/s* |
| Mac M4 Pro 24GB | ~38 tok/s | ~22 tok/s | ~35 tok/s* |
| Mac M4 Max 64GB | ~45 tok/s | ~28 tok/s | ~42 tok/s* |
MoE models activate only 3B parameters per token, giving them a speed advantage over dense models of equivalent total size.
Getting Started
- Find your fit: Use the VRAM calculator to see which Qwen 3 variant matches your hardware
- Install Ollama:
curl -fsSL https://ollama.com/install.sh | sh - Run your chosen model:
ollama run qwen3:8b # Small and fast ollama run qwen3:30b-a3b # Best balance (MoE) ollama run qwen3-coder:30b-a3b # Best coding on 24GB GPU