How much VRAM does Qwen 3.5 35B-A3B need?

Qwen 3.5 35B-A3B needs ~21.4 GB at Q4_K_M, ~25.2 GB at Q5_K_M, ~28.7 GB at Q6_K, and ~37.5 GB at Q8_0. Full precision FP16 requires ~71.8 GB. The 35B weights must all reside in VRAM even though only 3B are active per token — MoE saves compute, not memory.

Can I run Qwen 3.5 35B-A3B on RTX 4090?

Yes. The RTX 4090 (24 GB) runs Qwen 3.5 35B-A3B at Q4_K_M with ~2.5 GB of headroom for KV cache and context. Expect 50-80 tokens/second because only 3B parameters are active per token despite the 35B total footprint.

Does Qwen 3.5 35B-A3B fit on Mac M4 Max?

Yes, comfortably. An M4 Max with 36 GB unified memory runs Qwen 3.5 35B-A3B at Q4_K_M (~21.4 GB) with 15 GB headroom for macOS, apps, and context. The M4 Max 64 GB handles Q6 with room to spare. Use MLX for best throughput on Apple Silicon.

What is A3B in Qwen 3.5 35B-A3B?

A3B means 3 billion Active parameters per token. Qwen 3.5 35B-A3B is a Mixture of Experts (MoE) model: 35 billion total parameters with only 3 billion activated per forward pass via expert routing. You get 27B-class quality with 3B-class inference speed — but the full 35B weights must load in memory.

Qwen 3.5 35B-A3B vs Qwen 3.5 27B — which is better?

They are different trade-offs. 35B-A3B (MoE, 21.4 GB at Q4) offers faster inference and slightly better general quality. 27B dense (16.5 GB at Q4) is more predictable in latency, stronger at complex reasoning, and uses less VRAM. For chat and agentic workloads, 35B-A3B wins. For deep reasoning or code that fits in context, 27B dense is often preferable.

What is the best quantization for Qwen 3.5 35B-A3B?

Q4_K_M is the sweet spot for most users: minimal quality loss and fits on a single 24 GB GPU. If you have 32 GB VRAM or more, step up to Q5_K_M or Q6_K for near-lossless output. For agentic or coding workloads, prefer Q5_K_M or higher because MoE models can be slightly more sensitive to aggressive quantization at expert-routing layers.

April 20, 2026Updated April 22, 2026qwen, qwen-3-5, qwen3.5-35b-a3b, 35b-a3b, moe, vram, gpu-requirements, gguf, apple-silicon

Qwen3.5-35B-A3B VRAM Requirements 2026 — 21.4 GB at Q4

Qwen3.5-35B-A3B needs ~21.4 GB at Q4_K_M. Fits RTX 4090/3090 and Mac M4 Max. Exact Q4/Q5/Q6/Q8 GGUF numbers, tok/s benchmarks, and GPU recommendations.

Qwen3.5-35B-A3B VRAM requirements (2026): this page has the exact Q4/Q5/Q6/Q8/FP16 memory numbers, GGUF download sizes, and hardware fit across RTX 3090, RTX 4080, RTX 4090, RTX 5090, and Apple Silicon. Quick answer: ~21.4 GB at Q4_K_M — fits a single RTX 4090 or Mac M4 Max 36GB. Jump to your GPU tier below.

Can qwen3.5-35b-a3b run on my GPU?

GPU	VRAM	qwen3.5-35b-a3b fit
RTX 4060 Ti / RTX 5060	16 GB	❌ Does not fit — use Qwen 3.5 27B or 9B instead
RTX 3090 / 3090 Ti	24 GB	✅ Q4_K_M fits tightly (~2.5 GB headroom)
RTX 4090	24 GB	✅ Q4_K_M sweet spot
RTX 5090	32 GB	✅ Q5_K_M comfortable, Q6_K tight
RTX 4080 Super	16 GB	❌ Does not fit
A100 80GB / H100 80GB	80 GB	✅ Q8_0 with long context
Mac M4 Pro 24GB	24 GB unified	✅ Q4_K_M tight
Mac M4 Max 36GB+	36-128 GB	✅ Q5-Q8 comfortable

Quick answers

Q4_K_M: ~21.4 GB — fits on a 24 GB RTX 4090
Q5_K_M: ~25.2 GB — needs RTX 5090 32GB, or M4 Max 36GB+
Q8_0: ~37.5 GB — needs dual-24 GB GPUs or Apple Silicon 64 GB+
FP16: ~71.8 GB — H100 80GB, A100 80GB, or Mac Studio 96GB+
Active parameters: 3B per token (that's the "A3B" in the name)
Speed expectation: 50-80 tok/s on a 24 GB consumer GPU at Q4
Best deploy profile: llama.cpp GGUF, MLX on Apple Silicon, or Ollama for easy CLI setup

Qwen 3.5 35B-A3B specifications

Qwen 3.5 35B-A3B is the mid-tier Mixture of Experts variant in Alibaba's Qwen 3.5 lineup. It is designed to be the single-GPU-friendly MoE: the 3B active parameter budget keeps inference fast enough for interactive chat, while the 35B total parameter count provides knowledge capacity comparable to much larger dense models.

Spec	Value
Total parameters	35 billion
Active parameters per token	3 billion
Architecture	Mixture of Experts (MoE)
Context window	262,144 tokens (native)
Training data cutoff	2026 (refresh of Qwen 3 5B-A3B)
Provider	Alibaba Cloud
License	Open weights (Apache 2.0 commercial-friendly)
GGUF providers	Unsloth, LM Studio Community, bartowski
MLX provider	mlx-community

VRAM by quantization

These numbers are calibrated against the actual GGUF file sizes published on Hugging Face. Add 1-2 GB for KV cache and runtime overhead at default context length, or 5-10 GB if you push near the 262K context limit.

Quantization	VRAM (weights)	Real fit on 24 GB GPU	Real fit on M4 Max 36 GB
Q4_K_M	21.4 GB	✅ ~2.5 GB headroom	✅ comfortable
Q5_K_M	25.2 GB	❌ overflows	✅ ~11 GB headroom
Q6_K	28.7 GB	❌	✅ ~7 GB headroom
Q8_0	37.5 GB	❌	❌ (needs 64 GB+)
FP16	71.8 GB	❌	❌

Unsloth Dynamic 4-bit variants (UD-Q4_K_XL) trim another 1-2 GB by quantizing non-critical tensors more aggressively, bringing the footprint closer to 19-20 GB while preserving near-Q4_K_M quality on benchmarks.

Hardware compatibility matrix

24 GB GPUs — tight fit, excellent speed

On a 24 GB card, Q4_K_M leaves about 2.5 GB of headroom. That is enough for a 4K-8K context window in practice. Larger contexts (32K+) will need partial offloading or a larger GPU.

GPU	Fit at Q4	Speed	Notes
RTX 4090 24GB	✅	~70-85 tok/s	Best consumer option
RTX 3090 24GB	✅	~55-70 tok/s	Used-market value pick
RTX 3090 Ti 24GB	✅	~60-75 tok/s	~10% faster than 3090
RX 7900 XTX 24GB	✅	~55-70 tok/s	ROCm + Vulkan support
L4 24GB	✅	~40-55 tok/s	Low TDP, datacenter profile
A10 24GB	✅	~45-60 tok/s	Cloud-friendly

32 GB GPUs — comfortable, Q5 capable

GPU	Q4 fit	Q5 fit	Speed at Q4
RTX 5090 32GB	✅	✅	~120-170 tok/s
R9700 32GB	✅	✅	~100-130 tok/s

48 GB+ GPUs — any quantization

GPU	Q4	Q5	Q6	Q8	Speed at Q6
A6000 48GB	✅	✅	✅	❌	~80-100 tok/s
RTX 6000 Ada 48GB	✅	✅	✅	❌	~110-140 tok/s
RTX PRO 6000 Blackwell 96GB	✅	✅	✅	✅	~200+ tok/s

Multi-GPU configurations

Qwen 3.5 35B-A3B benefits less from multi-GPU than dense models because only 3B parameters are active per forward pass. That said, two 24 GB cards let you run Q8 (~37.5 GB) with NVLink or tensor parallel splits.

Example: 2× RTX PRO 6000 Max-Q users report ~2,600 tok/s aggregate on Qwen 3.5 35B-A3B Q4 with vLLM tensor parallelism, turning a single MoE model into a serving engine that comfortably handles dozens of concurrent requests.

Apple Silicon guide

MoE models fit Apple Silicon beautifully. The 3B active parameters mean the memory bandwidth bottleneck (the usual Apple Silicon ceiling) matters less — you are shoveling fewer bytes per token through the GPU.

Mac	RAM	Q4 fit	Q5 fit	Q6 fit	Speed at Q4
M4 16GB	16 GB	❌ tight	❌	❌	N/A
M4 Pro 24GB	24 GB	⚠️ marginal	❌	❌	~15-20 tok/s
M4 Max 36GB	36 GB	✅	✅	✅	~30-40 tok/s
M4 Max 64GB	64 GB	✅	✅	✅	~50-65 tok/s
M4 Max 128GB	128 GB	✅	✅	✅	~55-70 tok/s
M3 Ultra 512GB	512 GB	✅	✅	✅	~80+ tok/s at MLX 8-bit

Why MoE scales better on Mac: Apple Silicon's unified memory means the full 35B must fit in RAM, but bandwidth to those weights matters less when only 3B are activated per token. Reported community numbers: M3 Ultra 512GB at MLX 8-bit reaches 80.6 tok/s for 35B-A3B while using 39.3 GB — roughly equivalent to an RTX 4090.

MLX vs GGUF — which on Mac

Framework	Pros	Cons	Best for
MLX (mlx-lm, LM Studio)	Native to Apple Silicon, minimal overhead, best throughput	Limited to Mac, newer ecosystem	Mac-only users who want maximum performance
GGUF (llama.cpp, Ollama)	Cross-platform, huge model library, stable	Slightly higher memory overhead on Mac	Cross-device workflows, CLI-first setups

Recommended for Mac users: grab mlx-community/Qwen3.5-35B-A3B-MLX-4bit via LM Studio or mlx_lm.generate --model mlx-community/Qwen3.5-35B-A3B-MLX-4bit --prompt "...".

For a full comparison of MLX against Ollama on Apple Silicon, see the Qwen 3 & 3.5 Apple Silicon guide.

Setup commands

Ollama (easiest)

# Install Ollama (if needed)
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run Qwen 3.5 35B-A3B at Q4_K_M
ollama run qwen3.5:35b-a3b

llama.cpp (Unsloth Dynamic 4-bit)

# Clone and build
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp && make -j

# Download UD-Q4_K_XL (~19.7 GB)
huggingface-cli download unsloth/Qwen3.5-35B-A3B-GGUF \
  Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf --local-dir models/

# Run
./llama-cli -m models/Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf \
  -n 256 --color -cnv -p "You are a helpful assistant."

MLX on Mac

pip install mlx-lm
mlx_lm.generate --model mlx-community/Qwen3.5-35B-A3B-MLX-4bit \
  --prompt "Explain MoE routing in one paragraph."

LM Studio (GUI)

In LM Studio's Discover tab, search "Qwen3.5-35B-A3B" and pick either the GGUF (Q4_K_M) or MLX (4-bit) build depending on your platform.

How Qwen 3.5 35B-A3B compares

vs Qwen 3.5 27B dense

Metric	35B-A3B (MoE)	27B dense
VRAM Q4	21.4 GB	16.5 GB
Active params	3B	27B
Speed on RTX 4090	~70-85 tok/s	~30-40 tok/s
General knowledge	✅✅	✅
Complex reasoning	✅	✅✅
Coding	✅✅	✅✅

For interactive chat, 35B-A3B wins on speed. For long-context reasoning (32K+), 27B dense is more predictable.

vs Qwen 3 30B-A3B (previous gen)

Metric	Qwen 3.5 35B-A3B	Qwen 3 30B-A3B
VRAM Q4	21.4 GB	16.8 GB
Context	262K native	131K native
Quality	~+8% on MMLU	baseline

The 35B-A3B costs ~5 GB more VRAM but offers longer context and measurably better quality. If you are on a 16 GB GPU that can fit the older 30B-A3B but not the new one, consider waiting for 24 GB or stepping up to the Unsloth Dynamic 4-bit of 35B-A3B which squeezes closer to 20 GB.

Check compatibility

Can Qwen 3.5 35B-A3B run on my hardware? — fit calculator across 195+ GPU/Mac profiles
Qwen 3.5 35B-A3B on RTX 4090
Qwen 3.5 35B-A3B on M4 Max 36GB
Qwen 3.5 35B-A3B on A100 80GB

Related guides

Qwen 3.5 VRAM Requirements (complete family)
Qwen 3.6 VRAM & Hardware Requirements — what's coming next
Qwen 3 & 3.5 GPU Requirements — original lineup
Q4 vs Q5 vs Q8: Which GGUF quantization should you use?
Best GPU for running LLMs locally (2026)