Qwen 3.5 122B-A10B VRAM Requirements — MoE Workstation & Mac Studio Guide (Q4, Q5, Q6, Q8)
Qwen 3.5 122B-A10B needs ~74.4 GB at Q4_K_M. Fits on A100 80GB, H100, Mac Studio M4 Max 128GB, or M3 Ultra. Full VRAM table, benchmarks, and multi-GPU guidance.
Qwen 3.5 122B-A10B is the professional-tier MoE in the Qwen 3.5 family — frontier-class reasoning at a memory footprint that fits on a single H100 or Mac Studio. This guide has exact VRAM requirements plus realistic hardware recommendations.
Quick answers
- Q4_K_M: ~74.4 GB — fits on A100 80GB, H100 80GB, Mac Studio 128GB+
- Q5_K_M: ~87.8 GB — needs MI300X 192GB, Mac Studio Ultra 192GB+
- Q6_K: ~100 GB — needs multi-GPU (2× 80GB) or Mac Studio Ultra 192GB+
- Q8_0: ~130.5 GB — needs MI300X 192GB, Mac Studio M3 Ultra 256GB+
- FP16: ~250 GB — multi-GPU only (4× 80GB) or M3 Ultra 512GB
- Active parameters: 10B per token
- Speed on H100 80GB: ~55-75 tok/s at Q4, ~80-100 tok/s on MI300X at Q5
- Speed on Mac Studio: ~25-45 tok/s on M4 Ultra 192GB at Q5 via MLX
Qwen 3.5 122B-A10B specifications
| Spec | Value |
|---|---|
| Total parameters | 122 billion |
| Active parameters per token | 10 billion |
| Architecture | Mixture of Experts (MoE) |
| Context window | 262,144 tokens (native) |
| Provider | Alibaba Cloud |
| License | Open weights (Apache 2.0) |
| Experts | 128 total, 8 active per token |
| GGUF providers | Unsloth, bartowski, LM Studio Community |
| MLX provider | mlx-community |
VRAM by quantization
| Quantization | VRAM (weights) | Single 80 GB GPU | Single 192 GB GPU | Mac Studio 128 GB | Mac Studio M3 Ultra 512 GB |
|---|---|---|---|---|---|
| Q4_K_M | 74.4 GB | ✅ ~5 GB headroom | ✅ comfortable | ✅ | ✅ |
| Q5_K_M | 87.8 GB | ❌ overflows | ✅ ~100 GB headroom | ⚠️ marginal | ✅ |
| Q6_K | 100.0 GB | ❌ | ✅ ~90 GB headroom | ❌ | ✅ |
| Q8_0 | 130.5 GB | ❌ | ✅ ~60 GB headroom | ❌ | ✅ |
| FP16 | 250.1 GB | ❌ | ❌ | ❌ | ✅ (~260 GB free) |
Unsloth Dynamic 4-bit (UD-Q4_K_XL) brings this down to ~70 GB, making it viable on a single 80 GB GPU with generous context headroom.
Hardware compatibility matrix
Data-center single GPUs
| GPU | Q4 | Q5 | Q6 | Q8 | Speed at Q4 |
|---|---|---|---|---|---|
| A100 80GB | ✅ | ❌ | ❌ | ❌ | ~40-55 tok/s |
| H100 80GB | ✅ | ❌ | ❌ | ❌ | ~55-75 tok/s |
| H200 141GB | ✅ | ✅ | ✅ | ❌ | ~70-95 tok/s |
| MI300X 192GB | ✅ | ✅ | ✅ | ✅ tight | ~80-110 tok/s |
| B200 192GB | ✅ | ✅ | ✅ | ✅ tight | ~130-180 tok/s |
Multi-GPU configurations
MoE models parallelize well across GPUs because expert routing can be distributed.
| Setup | Total VRAM | Fit | Tooling |
|---|---|---|---|
| 2× A100 80GB (tensor parallel) | 160 GB | Q4, Q5 | vLLM with -tp 2 |
| 2× H100 80GB NVLink | 160 GB | Q4, Q5 | vLLM, TensorRT-LLM |
| 4× A100 80GB | 320 GB | Q4-Q8 | vLLM with -tp 4 |
| 2× RTX 4090 24GB | 48 GB | ❌ (insufficient) | Not viable for 122B |
| 2× RTX 6000 Ada 48GB | 96 GB | Q4 only | llama.cpp row-split |
| 2× RTX PRO 6000 96GB | 192 GB | Q4-Q8 | vLLM or llama.cpp |
For most self-hosted deployments, a single H100 80GB at Q4 is the cleanest setup. Multi-GPU adds ~15% overhead per inter-GPU hop.
Apple Silicon — Mac Studio territory
| Mac | RAM | Q4 fit | Q5 fit | Q6 fit | Q8 fit | Speed at Q4 |
|---|---|---|---|---|---|---|
| M4 Max 64GB | 64 GB | ❌ | ❌ | ❌ | ❌ | N/A |
| M4 Max 128GB | 128 GB | ✅ ~50 GB headroom | ❌ (tight) | ❌ | ❌ | ~20-30 tok/s |
| M4 Ultra 192GB | 192 GB | ✅ | ✅ ~100 GB headroom | ⚠️ marginal | ❌ | ~30-45 tok/s |
| M3 Ultra 256GB | 256 GB | ✅ | ✅ | ✅ | ✅ tight | ~40-55 tok/s |
| M3 Ultra 512GB | 512 GB | ✅ | ✅ | ✅ | ✅ | ~50-65 tok/s |
Mac Studio M3 Ultra 512GB is the only single-device consumer-reachable platform that runs FP16 full precision. At ~250 GB it still leaves 250+ GB for macOS, context, and concurrent workloads.
For the MLX vs Ollama throughput comparison on large Mac Studios, see MLX vs Ollama on Apple Silicon.
Setup commands
vLLM (production serving on H100/A100)
vllm serve unsloth/Qwen3.5-122B-A10B-GGUF \
--quantization gguf \
--max-model-len 131072 \
--gpu-memory-utilization 0.92
llama.cpp (single or dual GPU)
# Download UD-Q4_K_XL (~70 GB)
huggingface-cli download unsloth/Qwen3.5-122B-A10B-GGUF \
Qwen3.5-122B-A10B-UD-Q4_K_XL.gguf --local-dir models/
# Single 80 GB GPU
./llama-cli -m models/Qwen3.5-122B-A10B-UD-Q4_K_XL.gguf \
-n 512 -ngl 99 --color -cnv
# Dual 48 GB GPU split
./llama-cli -m models/Qwen3.5-122B-A10B-UD-Q4_K_XL.gguf \
-n 512 -ngl 99 --split-mode row --color -cnv
Ollama
ollama run qwen3.5:122b-a10b
MLX on Mac Studio
pip install mlx-lm
mlx_lm.generate \
--model mlx-community/Qwen3.5-122B-A10B-MLX-4bit \
--prompt "Summarize the following 100K-word document..." \
--max-tokens 4096
Qwen 3.5 122B-A10B vs alternatives
vs Qwen 3.5 35B-A3B
| Metric | 122B-A10B | 35B-A3B |
|---|---|---|
| VRAM Q4 | 74.4 GB | 21.4 GB |
| Active params | 10B | 3B |
| Speed on H100 80GB | ~55-75 tok/s | ~140-180 tok/s |
| Speed on M4 Max 64GB | ❌ (doesn't fit) | ~55-70 tok/s |
| Hardware tier | Pro / Mac Studio | Consumer (24 GB+) |
| Best for | Long-doc analysis, deep reasoning | Interactive chat, coding |
Pick the 35B-A3B unless you specifically need the extra knowledge capacity — it fits consumer hardware and is 2-3× faster.
vs Llama 4 Scout 109B
| Metric | Qwen 3.5 122B-A10B | Llama 4 Scout 109B |
|---|---|---|
| VRAM Q4 | 74.4 GB | ~61 GB |
| Active params | 10B | 17B |
| Context | 262K | 10M (extended) |
| License | Apache 2.0 | Llama 4 Community License |
Llama 4 Scout uses less VRAM but activates more parameters — similar speed, different trade-offs.
vs DeepSeek R1 Distill 70B
| Metric | Qwen 3.5 122B-A10B | DeepSeek R1 Distill 70B |
|---|---|---|
| VRAM Q4 | 74.4 GB | ~40 GB |
| Architecture | MoE (10B active) | Dense |
| Reasoning quality | Strong | Specialized for reasoning |
| Multilingual | ✅✅ (100+ languages) | ✅ |
Pick DeepSeek R1 70B for pure reasoning tasks. Pick Qwen 3.5 122B-A10B for broader multilingual knowledge + MoE speed.
Real-world use cases
- Long-document summarization: 262K context + 10B active params = entire books in one pass
- Multi-agent coordination: MoE routing lets you keep different experts specialized per agent role
- Domain-specific chatbots: larger knowledge capacity reduces hallucination in niche verticals
- Research assistant on a Mac Studio: M4 Ultra 192GB at MLX 4-bit gives local frontier-class reasoning without data leaving the machine
Check compatibility
- Qwen 3.5 122B-A10B model page — full spec + all hardware verdicts
- Qwen 3.5 122B-A10B on H100 80GB
- Qwen 3.5 122B-A10B on Mac Studio M4 Max 128GB
- Qwen 3.5 122B-A10B on Mac Studio M3 Ultra 512GB