Qwen3.5-35B-A3B VRAM Requirements 2026 — 21.4 GB at Q4
Qwen3.5-35B-A3B needs ~21.4 GB at Q4_K_M. Fits RTX 4090/3090 and Mac M4 Max. Exact Q4/Q5/Q6/Q8 GGUF numbers, tok/s benchmarks, and GPU recommendations.
Qwen3.5-35B-A3B VRAM requirements (2026): this page has the exact Q4/Q5/Q6/Q8/FP16 memory numbers, GGUF download sizes, and hardware fit across RTX 3090, RTX 4080, RTX 4090, RTX 5090, and Apple Silicon. Quick answer: ~21.4 GB at Q4_K_M — fits a single RTX 4090 or Mac M4 Max 36GB. Jump to your GPU tier below.
Can qwen3.5-35b-a3b run on my GPU?
| GPU | VRAM | qwen3.5-35b-a3b fit |
|---|---|---|
| RTX 4060 Ti / RTX 5060 | 16 GB | ❌ Does not fit — use Qwen 3.5 27B or 9B instead |
| RTX 3090 / 3090 Ti | 24 GB | ✅ Q4_K_M fits tightly (~2.5 GB headroom) |
| RTX 4090 | 24 GB | ✅ Q4_K_M sweet spot |
| RTX 5090 | 32 GB | ✅ Q5_K_M comfortable, Q6_K tight |
| RTX 4080 Super | 16 GB | ❌ Does not fit |
| A100 80GB / H100 80GB | 80 GB | ✅ Q8_0 with long context |
| Mac M4 Pro 24GB | 24 GB unified | ✅ Q4_K_M tight |
| Mac M4 Max 36GB+ | 36-128 GB | ✅ Q5-Q8 comfortable |
Quick answers
- Q4_K_M: ~21.4 GB — fits on a 24 GB RTX 4090
- Q5_K_M: ~25.2 GB — needs RTX 5090 32GB, or M4 Max 36GB+
- Q8_0: ~37.5 GB — needs dual-24 GB GPUs or Apple Silicon 64 GB+
- FP16: ~71.8 GB — H100 80GB, A100 80GB, or Mac Studio 96GB+
- Active parameters: 3B per token (that's the "A3B" in the name)
- Speed expectation: 50-80 tok/s on a 24 GB consumer GPU at Q4
- Best deploy profile: llama.cpp GGUF, MLX on Apple Silicon, or Ollama for easy CLI setup
Qwen 3.5 35B-A3B specifications
Qwen 3.5 35B-A3B is the mid-tier Mixture of Experts variant in Alibaba's Qwen 3.5 lineup. It is designed to be the single-GPU-friendly MoE: the 3B active parameter budget keeps inference fast enough for interactive chat, while the 35B total parameter count provides knowledge capacity comparable to much larger dense models.
| Spec | Value |
|---|---|
| Total parameters | 35 billion |
| Active parameters per token | 3 billion |
| Architecture | Mixture of Experts (MoE) |
| Context window | 262,144 tokens (native) |
| Training data cutoff | 2026 (refresh of Qwen 3 5B-A3B) |
| Provider | Alibaba Cloud |
| License | Open weights (Apache 2.0 commercial-friendly) |
| GGUF providers | Unsloth, LM Studio Community, bartowski |
| MLX provider | mlx-community |
VRAM by quantization
These numbers are calibrated against the actual GGUF file sizes published on Hugging Face. Add 1-2 GB for KV cache and runtime overhead at default context length, or 5-10 GB if you push near the 262K context limit.
| Quantization | VRAM (weights) | Real fit on 24 GB GPU | Real fit on M4 Max 36 GB |
|---|---|---|---|
| Q4_K_M | 21.4 GB | ✅ ~2.5 GB headroom | ✅ comfortable |
| Q5_K_M | 25.2 GB | ❌ overflows | ✅ ~11 GB headroom |
| Q6_K | 28.7 GB | ❌ | ✅ ~7 GB headroom |
| Q8_0 | 37.5 GB | ❌ | ❌ (needs 64 GB+) |
| FP16 | 71.8 GB | ❌ | ❌ |
Unsloth Dynamic 4-bit variants (UD-Q4_K_XL) trim another 1-2 GB by quantizing non-critical tensors more aggressively, bringing the footprint closer to 19-20 GB while preserving near-Q4_K_M quality on benchmarks.
Hardware compatibility matrix
24 GB GPUs — tight fit, excellent speed
On a 24 GB card, Q4_K_M leaves about 2.5 GB of headroom. That is enough for a 4K-8K context window in practice. Larger contexts (32K+) will need partial offloading or a larger GPU.
| GPU | Fit at Q4 | Speed | Notes |
|---|---|---|---|
| RTX 4090 24GB | ✅ | ~70-85 tok/s | Best consumer option |
| RTX 3090 24GB | ✅ | ~55-70 tok/s | Used-market value pick |
| RTX 3090 Ti 24GB | ✅ | ~60-75 tok/s | ~10% faster than 3090 |
| RX 7900 XTX 24GB | ✅ | ~55-70 tok/s | ROCm + Vulkan support |
| L4 24GB | ✅ | ~40-55 tok/s | Low TDP, datacenter profile |
| A10 24GB | ✅ | ~45-60 tok/s | Cloud-friendly |
32 GB GPUs — comfortable, Q5 capable
| GPU | Q4 fit | Q5 fit | Speed at Q4 |
|---|---|---|---|
| RTX 5090 32GB | ✅ | ✅ | ~120-170 tok/s |
| R9700 32GB | ✅ | ✅ | ~100-130 tok/s |
48 GB+ GPUs — any quantization
| GPU | Q4 | Q5 | Q6 | Q8 | Speed at Q6 |
|---|---|---|---|---|---|
| A6000 48GB | ✅ | ✅ | ✅ | ❌ | ~80-100 tok/s |
| RTX 6000 Ada 48GB | ✅ | ✅ | ✅ | ❌ | ~110-140 tok/s |
| RTX PRO 6000 Blackwell 96GB | ✅ | ✅ | ✅ | ✅ | ~200+ tok/s |
Multi-GPU configurations
Qwen 3.5 35B-A3B benefits less from multi-GPU than dense models because only 3B parameters are active per forward pass. That said, two 24 GB cards let you run Q8 (~37.5 GB) with NVLink or tensor parallel splits.
Example: 2× RTX PRO 6000 Max-Q users report ~2,600 tok/s aggregate on Qwen 3.5 35B-A3B Q4 with vLLM tensor parallelism, turning a single MoE model into a serving engine that comfortably handles dozens of concurrent requests.
Apple Silicon guide
MoE models fit Apple Silicon beautifully. The 3B active parameters mean the memory bandwidth bottleneck (the usual Apple Silicon ceiling) matters less — you are shoveling fewer bytes per token through the GPU.
| Mac | RAM | Q4 fit | Q5 fit | Q6 fit | Speed at Q4 |
|---|---|---|---|---|---|
| M4 16GB | 16 GB | ❌ tight | ❌ | ❌ | N/A |
| M4 Pro 24GB | 24 GB | ⚠️ marginal | ❌ | ❌ | ~15-20 tok/s |
| M4 Max 36GB | 36 GB | ✅ | ✅ | ✅ | ~30-40 tok/s |
| M4 Max 64GB | 64 GB | ✅ | ✅ | ✅ | ~50-65 tok/s |
| M4 Max 128GB | 128 GB | ✅ | ✅ | ✅ | ~55-70 tok/s |
| M3 Ultra 512GB | 512 GB | ✅ | ✅ | ✅ | ~80+ tok/s at MLX 8-bit |
Why MoE scales better on Mac: Apple Silicon's unified memory means the full 35B must fit in RAM, but bandwidth to those weights matters less when only 3B are activated per token. Reported community numbers: M3 Ultra 512GB at MLX 8-bit reaches 80.6 tok/s for 35B-A3B while using 39.3 GB — roughly equivalent to an RTX 4090.
MLX vs GGUF — which on Mac
| Framework | Pros | Cons | Best for |
|---|---|---|---|
| MLX (mlx-lm, LM Studio) | Native to Apple Silicon, minimal overhead, best throughput | Limited to Mac, newer ecosystem | Mac-only users who want maximum performance |
| GGUF (llama.cpp, Ollama) | Cross-platform, huge model library, stable | Slightly higher memory overhead on Mac | Cross-device workflows, CLI-first setups |
Recommended for Mac users: grab mlx-community/Qwen3.5-35B-A3B-MLX-4bit via LM Studio or mlx_lm.generate --model mlx-community/Qwen3.5-35B-A3B-MLX-4bit --prompt "...".
For a full comparison of MLX against Ollama on Apple Silicon, see the Qwen 3 & 3.5 Apple Silicon guide.
Setup commands
Ollama (easiest)
# Install Ollama (if needed)
curl -fsSL https://ollama.com/install.sh | sh
# Pull and run Qwen 3.5 35B-A3B at Q4_K_M
ollama run qwen3.5:35b-a3b
llama.cpp (Unsloth Dynamic 4-bit)
# Clone and build
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp && make -j
# Download UD-Q4_K_XL (~19.7 GB)
huggingface-cli download unsloth/Qwen3.5-35B-A3B-GGUF \
Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf --local-dir models/
# Run
./llama-cli -m models/Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf \
-n 256 --color -cnv -p "You are a helpful assistant."
MLX on Mac
pip install mlx-lm
mlx_lm.generate --model mlx-community/Qwen3.5-35B-A3B-MLX-4bit \
--prompt "Explain MoE routing in one paragraph."
LM Studio (GUI)
In LM Studio's Discover tab, search "Qwen3.5-35B-A3B" and pick either the GGUF (Q4_K_M) or MLX (4-bit) build depending on your platform.
How Qwen 3.5 35B-A3B compares
vs Qwen 3.5 27B dense
| Metric | 35B-A3B (MoE) | 27B dense |
|---|---|---|
| VRAM Q4 | 21.4 GB | 16.5 GB |
| Active params | 3B | 27B |
| Speed on RTX 4090 | ~70-85 tok/s | ~30-40 tok/s |
| General knowledge | ✅✅ | ✅ |
| Complex reasoning | ✅ | ✅✅ |
| Coding | ✅✅ | ✅✅ |
For interactive chat, 35B-A3B wins on speed. For long-context reasoning (32K+), 27B dense is more predictable.
vs Qwen 3 30B-A3B (previous gen)
| Metric | Qwen 3.5 35B-A3B | Qwen 3 30B-A3B |
|---|---|---|
| VRAM Q4 | 21.4 GB | 16.8 GB |
| Context | 262K native | 131K native |
| Quality | ~+8% on MMLU | baseline |
The 35B-A3B costs ~5 GB more VRAM but offers longer context and measurably better quality. If you are on a 16 GB GPU that can fit the older 30B-A3B but not the new one, consider waiting for 24 GB or stepping up to the Unsloth Dynamic 4-bit of 35B-A3B which squeezes closer to 20 GB.
Check compatibility
- Can Qwen 3.5 35B-A3B run on my hardware? — fit calculator across 195+ GPU/Mac profiles
- Qwen 3.5 35B-A3B on RTX 4090
- Qwen 3.5 35B-A3B on M4 Max 36GB
- Qwen 3.5 35B-A3B on A100 80GB