How much VRAM does Qwen 3.5 122B-A10B need?

Qwen 3.5 122B-A10B needs ~74.4 GB at Q4_K_M, ~87.8 GB at Q5_K_M, ~100 GB at Q6_K, and ~130.5 GB at Q8_0. Full FP16 requires ~250 GB — firmly multi-GPU territory. Add 5-10 GB for KV cache at long contexts.

What hardware can run Qwen 3.5 122B-A10B?

On workstation GPUs: A100 80GB (fits Q4), H100 80GB (faster Q4), MI300X 192GB (fits Q5 with headroom). On Apple Silicon: Mac Studio M4 Max 128GB (fits Q4 comfortably), Mac Studio M4 Ultra 192GB (fits Q5), M3 Ultra 512GB (runs everything including Q8). Consumer GPUs are out — you need pro-tier single GPUs or multi-GPU setups.

Qwen 3.5 122B-A10B vs 35B-A3B — when is the bigger MoE worth it?

Pick 122B-A10B when your workflow benefits from larger knowledge capacity: long-document analysis, complex agentic reasoning, or specialized domains. The 10B active parameters deliver noticeably stronger reasoning than 35B-A3B's 3B active, at the cost of 3.5× more memory. For interactive chat and coding, 35B-A3B is usually the better choice.

Can Mac Studio run Qwen 3.5 122B-A10B?

Yes, and it is one of the best platforms for this model. Mac Studio M4 Max 128GB runs Q4_K_M (~74.4 GB) comfortably with room for context. Mac Studio M4 Ultra 192GB handles Q5 (~87.8 GB). Mac Studio M3 Ultra 512GB can run Q8 (~130.5 GB) with massive context headroom. Use MLX for best throughput on Apple Silicon.

What is A10B in Qwen 3.5 122B-A10B?

A10B means 10 billion Active parameters per token. It is a Mixture of Experts (MoE) model with 122B total parameters but only 10B activated per forward pass. Quality approaches that of dense 70B+ models while inference runs at roughly the speed of a 10B dense model. All 122B weights must reside in memory — MoE saves compute, not memory.

Multi-GPU setup for Qwen 3.5 122B-A10B — what is needed?

Two A100 40GB or two RTX 4090 24GB cards can run Q4 with tensor parallelism via vLLM or llama.cpp row-split mode. The split adds ~15-20% overhead vs a single 80GB card. For serving many concurrent users, a single H100 80GB or MI300X 192GB beats a dual-GPU setup thanks to higher memory bandwidth and no inter-GPU communication.

April 20, 2026qwen, qwen-3-5, 122b-a10b, moe, vram, workstation, apple-silicon, h100

Qwen 3.5 122B-A10B VRAM Requirements — MoE Workstation & Mac Studio Guide (Q4, Q5, Q6, Q8)

Qwen 3.5 122B-A10B needs ~74.4 GB at Q4_K_M. Fits on A100 80GB, H100, Mac Studio M4 Max 128GB, or M3 Ultra. Full VRAM table, benchmarks, and multi-GPU guidance.

Qwen 3.5 122B-A10B is the professional-tier MoE in the Qwen 3.5 family — frontier-class reasoning at a memory footprint that fits on a single H100 or Mac Studio. This guide has exact VRAM requirements plus realistic hardware recommendations.

Quick answers

Q4_K_M: ~74.4 GB — fits on A100 80GB, H100 80GB, Mac Studio 128GB+
Q5_K_M: ~87.8 GB — needs MI300X 192GB, Mac Studio Ultra 192GB+
Q6_K: ~100 GB — needs multi-GPU (2× 80GB) or Mac Studio Ultra 192GB+
Q8_0: ~130.5 GB — needs MI300X 192GB, Mac Studio M3 Ultra 256GB+
FP16: ~250 GB — multi-GPU only (4× 80GB) or M3 Ultra 512GB
Active parameters: 10B per token
Speed on H100 80GB: ~55-75 tok/s at Q4, ~80-100 tok/s on MI300X at Q5
Speed on Mac Studio: ~25-45 tok/s on M4 Ultra 192GB at Q5 via MLX

Qwen 3.5 122B-A10B specifications

Spec	Value
Total parameters	122 billion
Active parameters per token	10 billion
Architecture	Mixture of Experts (MoE)
Context window	262,144 tokens (native)
Provider	Alibaba Cloud
License	Open weights (Apache 2.0)
Experts	128 total, 8 active per token
GGUF providers	Unsloth, bartowski, LM Studio Community
MLX provider	mlx-community

VRAM by quantization

Quantization	VRAM (weights)	Single 80 GB GPU	Single 192 GB GPU	Mac Studio 128 GB	Mac Studio M3 Ultra 512 GB
Q4_K_M	74.4 GB	✅ ~5 GB headroom	✅ comfortable	✅	✅
Q5_K_M	87.8 GB	❌ overflows	✅ ~100 GB headroom	⚠️ marginal	✅
Q6_K	100.0 GB	❌	✅ ~90 GB headroom	❌	✅
Q8_0	130.5 GB	❌	✅ ~60 GB headroom	❌	✅
FP16	250.1 GB	❌	❌	❌	✅ (~260 GB free)

Unsloth Dynamic 4-bit (UD-Q4_K_XL) brings this down to ~70 GB, making it viable on a single 80 GB GPU with generous context headroom.

Hardware compatibility matrix

Data-center single GPUs

GPU	Q4	Q5	Q6	Q8	Speed at Q4
A100 80GB	✅	❌	❌	❌	~40-55 tok/s
H100 80GB	✅	❌	❌	❌	~55-75 tok/s
H200 141GB	✅	✅	✅	❌	~70-95 tok/s
MI300X 192GB	✅	✅	✅	✅ tight	~80-110 tok/s
B200 192GB	✅	✅	✅	✅ tight	~130-180 tok/s

Multi-GPU configurations

MoE models parallelize well across GPUs because expert routing can be distributed.

Setup	Total VRAM	Fit	Tooling
2× A100 80GB (tensor parallel)	160 GB	Q4, Q5	vLLM with `-tp 2`
2× H100 80GB NVLink	160 GB	Q4, Q5	vLLM, TensorRT-LLM
4× A100 80GB	320 GB	Q4-Q8	vLLM with `-tp 4`
2× RTX 4090 24GB	48 GB	❌ (insufficient)	Not viable for 122B
2× RTX 6000 Ada 48GB	96 GB	Q4 only	llama.cpp row-split
2× RTX PRO 6000 96GB	192 GB	Q4-Q8	vLLM or llama.cpp

For most self-hosted deployments, a single H100 80GB at Q4 is the cleanest setup. Multi-GPU adds ~15% overhead per inter-GPU hop.

Apple Silicon — Mac Studio territory

Mac	RAM	Q4 fit	Q5 fit	Q6 fit	Q8 fit	Speed at Q4
M4 Max 64GB	64 GB	❌	❌	❌	❌	N/A
M4 Max 128GB	128 GB	✅ ~50 GB headroom	❌ (tight)	❌	❌	~20-30 tok/s
M4 Ultra 192GB	192 GB	✅	✅ ~100 GB headroom	⚠️ marginal	❌	~30-45 tok/s
M3 Ultra 256GB	256 GB	✅	✅	✅	✅ tight	~40-55 tok/s
M3 Ultra 512GB	512 GB	✅	✅	✅	✅	~50-65 tok/s

Mac Studio M3 Ultra 512GB is the only single-device consumer-reachable platform that runs FP16 full precision. At ~250 GB it still leaves 250+ GB for macOS, context, and concurrent workloads.

For the MLX vs Ollama throughput comparison on large Mac Studios, see MLX vs Ollama on Apple Silicon.

Setup commands

vLLM (production serving on H100/A100)

vllm serve unsloth/Qwen3.5-122B-A10B-GGUF \
  --quantization gguf \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.92

llama.cpp (single or dual GPU)

# Download UD-Q4_K_XL (~70 GB)
huggingface-cli download unsloth/Qwen3.5-122B-A10B-GGUF \
  Qwen3.5-122B-A10B-UD-Q4_K_XL.gguf --local-dir models/

# Single 80 GB GPU
./llama-cli -m models/Qwen3.5-122B-A10B-UD-Q4_K_XL.gguf \
  -n 512 -ngl 99 --color -cnv

# Dual 48 GB GPU split
./llama-cli -m models/Qwen3.5-122B-A10B-UD-Q4_K_XL.gguf \
  -n 512 -ngl 99 --split-mode row --color -cnv

Ollama

ollama run qwen3.5:122b-a10b

MLX on Mac Studio

pip install mlx-lm

mlx_lm.generate \
  --model mlx-community/Qwen3.5-122B-A10B-MLX-4bit \
  --prompt "Summarize the following 100K-word document..." \
  --max-tokens 4096

Qwen 3.5 122B-A10B vs alternatives

vs Qwen 3.5 35B-A3B

Metric	122B-A10B	35B-A3B
VRAM Q4	74.4 GB	21.4 GB
Active params	10B	3B
Speed on H100 80GB	~55-75 tok/s	~140-180 tok/s
Speed on M4 Max 64GB	❌ (doesn't fit)	~55-70 tok/s
Hardware tier	Pro / Mac Studio	Consumer (24 GB+)
Best for	Long-doc analysis, deep reasoning	Interactive chat, coding

Pick the 35B-A3B unless you specifically need the extra knowledge capacity — it fits consumer hardware and is 2-3× faster.

vs Llama 4 Scout 109B

Metric	Qwen 3.5 122B-A10B	Llama 4 Scout 109B
VRAM Q4	74.4 GB	~61 GB
Active params	10B	17B
Context	262K	10M (extended)
License	Apache 2.0	Llama 4 Community License

Llama 4 Scout uses less VRAM but activates more parameters — similar speed, different trade-offs.

vs DeepSeek R1 Distill 70B

Metric	Qwen 3.5 122B-A10B	DeepSeek R1 Distill 70B
VRAM Q4	74.4 GB	~40 GB
Architecture	MoE (10B active)	Dense
Reasoning quality	Strong	Specialized for reasoning
Multilingual	✅✅ (100+ languages)	✅

Pick DeepSeek R1 70B for pure reasoning tasks. Pick Qwen 3.5 122B-A10B for broader multilingual knowledge + MoE speed.

Real-world use cases

Long-document summarization: 262K context + 10B active params = entire books in one pass
Multi-agent coordination: MoE routing lets you keep different experts specialized per agent role
Domain-specific chatbots: larger knowledge capacity reduces hallucination in niche verticals
Research assistant on a Mac Studio: M4 Ultra 192GB at MLX 4-bit gives local frontier-class reasoning without data leaving the machine

Check compatibility

Qwen 3.5 122B-A10B model page — full spec + all hardware verdicts
Qwen 3.5 122B-A10B on H100 80GB
Qwen 3.5 122B-A10B on Mac Studio M4 Max 128GB
Qwen 3.5 122B-A10B on Mac Studio M3 Ultra 512GB