Will It Run AI
qwen, qwen-3-5, 122b-a10b, moe, vram, workstation, apple-silicon, h100

Qwen 3.5 122B-A10B VRAM Requirements — MoE Workstation & Mac Studio Guide (Q4, Q5, Q6, Q8)

Qwen 3.5 122B-A10B needs ~74.4 GB at Q4_K_M. Fits on A100 80GB, H100, Mac Studio M4 Max 128GB, or M3 Ultra. Full VRAM table, benchmarks, and multi-GPU guidance.

Qwen 3.5 122B-A10B is the professional-tier MoE in the Qwen 3.5 family — frontier-class reasoning at a memory footprint that fits on a single H100 or Mac Studio. This guide has exact VRAM requirements plus realistic hardware recommendations.

Quick answers

  • Q4_K_M: ~74.4 GB — fits on A100 80GB, H100 80GB, Mac Studio 128GB+
  • Q5_K_M: ~87.8 GB — needs MI300X 192GB, Mac Studio Ultra 192GB+
  • Q6_K: ~100 GB — needs multi-GPU (2× 80GB) or Mac Studio Ultra 192GB+
  • Q8_0: ~130.5 GB — needs MI300X 192GB, Mac Studio M3 Ultra 256GB+
  • FP16: ~250 GB — multi-GPU only (4× 80GB) or M3 Ultra 512GB
  • Active parameters: 10B per token
  • Speed on H100 80GB: ~55-75 tok/s at Q4, ~80-100 tok/s on MI300X at Q5
  • Speed on Mac Studio: ~25-45 tok/s on M4 Ultra 192GB at Q5 via MLX

Qwen 3.5 122B-A10B specifications

SpecValue
Total parameters122 billion
Active parameters per token10 billion
ArchitectureMixture of Experts (MoE)
Context window262,144 tokens (native)
ProviderAlibaba Cloud
LicenseOpen weights (Apache 2.0)
Experts128 total, 8 active per token
GGUF providersUnsloth, bartowski, LM Studio Community
MLX providermlx-community

VRAM by quantization

QuantizationVRAM (weights)Single 80 GB GPUSingle 192 GB GPUMac Studio 128 GBMac Studio M3 Ultra 512 GB
Q4_K_M74.4 GB✅ ~5 GB headroom✅ comfortable
Q5_K_M87.8 GB❌ overflows✅ ~100 GB headroom⚠️ marginal
Q6_K100.0 GB✅ ~90 GB headroom
Q8_0130.5 GB✅ ~60 GB headroom
FP16250.1 GB✅ (~260 GB free)

Unsloth Dynamic 4-bit (UD-Q4_K_XL) brings this down to ~70 GB, making it viable on a single 80 GB GPU with generous context headroom.

Hardware compatibility matrix

Data-center single GPUs

GPUQ4Q5Q6Q8Speed at Q4
A100 80GB~40-55 tok/s
H100 80GB~55-75 tok/s
H200 141GB~70-95 tok/s
MI300X 192GB✅ tight~80-110 tok/s
B200 192GB✅ tight~130-180 tok/s

Multi-GPU configurations

MoE models parallelize well across GPUs because expert routing can be distributed.

SetupTotal VRAMFitTooling
2× A100 80GB (tensor parallel)160 GBQ4, Q5vLLM with -tp 2
2× H100 80GB NVLink160 GBQ4, Q5vLLM, TensorRT-LLM
4× A100 80GB320 GBQ4-Q8vLLM with -tp 4
2× RTX 4090 24GB48 GB❌ (insufficient)Not viable for 122B
2× RTX 6000 Ada 48GB96 GBQ4 onlyllama.cpp row-split
2× RTX PRO 6000 96GB192 GBQ4-Q8vLLM or llama.cpp

For most self-hosted deployments, a single H100 80GB at Q4 is the cleanest setup. Multi-GPU adds ~15% overhead per inter-GPU hop.

Apple Silicon — Mac Studio territory

MacRAMQ4 fitQ5 fitQ6 fitQ8 fitSpeed at Q4
M4 Max 64GB64 GBN/A
M4 Max 128GB128 GB✅ ~50 GB headroom❌ (tight)~20-30 tok/s
M4 Ultra 192GB192 GB✅ ~100 GB headroom⚠️ marginal~30-45 tok/s
M3 Ultra 256GB256 GB✅ tight~40-55 tok/s
M3 Ultra 512GB512 GB~50-65 tok/s

Mac Studio M3 Ultra 512GB is the only single-device consumer-reachable platform that runs FP16 full precision. At ~250 GB it still leaves 250+ GB for macOS, context, and concurrent workloads.

For the MLX vs Ollama throughput comparison on large Mac Studios, see MLX vs Ollama on Apple Silicon.

Setup commands

vLLM (production serving on H100/A100)

vllm serve unsloth/Qwen3.5-122B-A10B-GGUF \
  --quantization gguf \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.92

llama.cpp (single or dual GPU)

# Download UD-Q4_K_XL (~70 GB)
huggingface-cli download unsloth/Qwen3.5-122B-A10B-GGUF \
  Qwen3.5-122B-A10B-UD-Q4_K_XL.gguf --local-dir models/

# Single 80 GB GPU
./llama-cli -m models/Qwen3.5-122B-A10B-UD-Q4_K_XL.gguf \
  -n 512 -ngl 99 --color -cnv

# Dual 48 GB GPU split
./llama-cli -m models/Qwen3.5-122B-A10B-UD-Q4_K_XL.gguf \
  -n 512 -ngl 99 --split-mode row --color -cnv

Ollama

ollama run qwen3.5:122b-a10b

MLX on Mac Studio

pip install mlx-lm

mlx_lm.generate \
  --model mlx-community/Qwen3.5-122B-A10B-MLX-4bit \
  --prompt "Summarize the following 100K-word document..." \
  --max-tokens 4096

Qwen 3.5 122B-A10B vs alternatives

vs Qwen 3.5 35B-A3B

Metric122B-A10B35B-A3B
VRAM Q474.4 GB21.4 GB
Active params10B3B
Speed on H100 80GB~55-75 tok/s~140-180 tok/s
Speed on M4 Max 64GB❌ (doesn't fit)~55-70 tok/s
Hardware tierPro / Mac StudioConsumer (24 GB+)
Best forLong-doc analysis, deep reasoningInteractive chat, coding

Pick the 35B-A3B unless you specifically need the extra knowledge capacity — it fits consumer hardware and is 2-3× faster.

vs Llama 4 Scout 109B

MetricQwen 3.5 122B-A10BLlama 4 Scout 109B
VRAM Q474.4 GB~61 GB
Active params10B17B
Context262K10M (extended)
LicenseApache 2.0Llama 4 Community License

Llama 4 Scout uses less VRAM but activates more parameters — similar speed, different trade-offs.

vs DeepSeek R1 Distill 70B

MetricQwen 3.5 122B-A10BDeepSeek R1 Distill 70B
VRAM Q474.4 GB~40 GB
ArchitectureMoE (10B active)Dense
Reasoning qualityStrongSpecialized for reasoning
Multilingual✅✅ (100+ languages)

Pick DeepSeek R1 70B for pure reasoning tasks. Pick Qwen 3.5 122B-A10B for broader multilingual knowledge + MoE speed.

Real-world use cases

  • Long-document summarization: 262K context + 10B active params = entire books in one pass
  • Multi-agent coordination: MoE routing lets you keep different experts specialized per agent role
  • Domain-specific chatbots: larger knowledge capacity reduces hallucination in niche verticals
  • Research assistant on a Mac Studio: M4 Ultra 192GB at MLX 4-bit gives local frontier-class reasoning without data leaving the machine

Check compatibility

Related guides

Frequently Asked Questions

How much VRAM does Qwen 3.5 122B-A10B need?

Qwen 3.5 122B-A10B needs ~74.4 GB at Q4_K_M, ~87.8 GB at Q5_K_M, ~100 GB at Q6_K, and ~130.5 GB at Q8_0. Full FP16 requires ~250 GB — firmly multi-GPU territory. Add 5-10 GB for KV cache at long contexts.

What hardware can run Qwen 3.5 122B-A10B?

On workstation GPUs: A100 80GB (fits Q4), H100 80GB (faster Q4), MI300X 192GB (fits Q5 with headroom). On Apple Silicon: Mac Studio M4 Max 128GB (fits Q4 comfortably), Mac Studio M4 Ultra 192GB (fits Q5), M3 Ultra 512GB (runs everything including Q8). Consumer GPUs are out — you need pro-tier single GPUs or multi-GPU setups.

Qwen 3.5 122B-A10B vs 35B-A3B — when is the bigger MoE worth it?

Pick 122B-A10B when your workflow benefits from larger knowledge capacity: long-document analysis, complex agentic reasoning, or specialized domains. The 10B active parameters deliver noticeably stronger reasoning than 35B-A3B's 3B active, at the cost of 3.5× more memory. For interactive chat and coding, 35B-A3B is usually the better choice.

Can Mac Studio run Qwen 3.5 122B-A10B?

Yes, and it is one of the best platforms for this model. Mac Studio M4 Max 128GB runs Q4_K_M (~74.4 GB) comfortably with room for context. Mac Studio M4 Ultra 192GB handles Q5 (~87.8 GB). Mac Studio M3 Ultra 512GB can run Q8 (~130.5 GB) with massive context headroom. Use MLX for best throughput on Apple Silicon.

What is A10B in Qwen 3.5 122B-A10B?

A10B means 10 billion Active parameters per token. It is a Mixture of Experts (MoE) model with 122B total parameters but only 10B activated per forward pass. Quality approaches that of dense 70B+ models while inference runs at roughly the speed of a 10B dense model. All 122B weights must reside in memory — MoE saves compute, not memory.

Multi-GPU setup for Qwen 3.5 122B-A10B — what is needed?

Two A100 40GB or two RTX 4090 24GB cards can run Q4 with tensor parallelism via vLLM or llama.cpp row-split mode. The split adds ~15-20% overhead vs a single 80GB card. For serving many concurrent users, a single H100 80GB or MI300X 192GB beats a dual-GPU setup thanks to higher memory bandwidth and no inter-GPU communication.