This guide covers running Qwen 3.5 specifically on Apple Silicon via MLX, with exact memory numbers per variant and benchmarks from MacBook Air M4 16GB through Mac Studio M3 Ultra 512GB. For the generic MLX-vs-Ollama framework comparison, see MLX vs Ollama on Apple Silicon.
Quick answers — which Qwen 3.5 for which Mac
Memory usage — MLX quantizations per variant
These are peak unified memory usage numbers during inference (model weights + KV cache at default 4K context). Add ~1 GB per 8K additional context.
Qwen 3.5 9B dense
| MLX quant | Peak memory | Quality | Notes |
|---|
| 4-bit | ~6.0 GB | Minor loss | Fits on 16 GB Macs with room for apps |
| 6-bit | ~7.9 GB | Near-lossless | Comfortable on 16 GB+ Macs |
| 8-bit | ~10.4 GB | Effectively FP16 | Needs 24 GB+ for comfortable context |
| FP16 | ~20.3 GB | Reference | Needs 32 GB+ Macs (M4 Max tier) |
Qwen 3.5 27B dense
| MLX quant | Peak memory | Quality | Notes |
|---|
| 4-bit | ~16.8 GB | Minor loss | Tight on 24 GB Macs; comfortable on 36 GB+ |
| 6-bit | ~22.5 GB | Near-lossless | 36 GB+ recommended |
| 8-bit | ~29.2 GB | Effectively FP16 | 48 GB+ (Mac Studio tier) |
| FP16 | ~56.2 GB | Reference | 64 GB+ |
Qwen 3.5 35B-A3B MoE
| MLX quant | Peak memory | Quality | Notes |
|---|
| 4-bit | ~19.5-20.5 GB | Minor loss | Fits M4 Max 36GB comfortably |
| 4-bit (VLM variant) | ~20.5-22.0 GB | Minor loss | Includes vision encoder |
| 6-bit | ~28.9 GB | Near-lossless | M4 Max 36GB comfortable, 64GB preferred |
| 8-bit | ~38.0 GB | Effectively FP16 | 64 GB+ |
| Extreme 2-bit (community) | ~11.3-12.5 GB | Major loss but usable | Fits on M4 16GB (tight) |
Community note: one MLX build ("Qwen3.5-35B-A3B-Alis-MLX-Dynamic-2.6bpw-VLM") uses extreme dynamic quantization (Expert 2-bit + Attention/GDN 3-bit) to fit 35B-A3B into 16 GB. Quality drops meaningfully but inference works at ~5-6 tok/s on M4 16GB.
Qwen 3.5 122B-A10B MoE
| MLX quant | Peak memory | Quality | Notes |
|---|
| 4-bit | ~75 GB | Minor loss | Fits Mac Studio M4 Max 128GB |
| 5-bit | ~89 GB | Near-lossless | Mac Studio Ultra 192GB+ |
| 6-bit | ~102 GB | Near-lossless | Mac Studio Ultra 192GB+ |
| 8-bit | ~132 GB | Effectively FP16 | M3 Ultra 256GB+ |
| FP16 | ~252 GB | Reference | M3 Ultra 512GB |
Tokens/second — per Mac and variant
Numbers triangulated from r/LocalLLaMA community reports, Ollama 0.19 release benchmarks, and Apple Silicon LLM inference guides. Decode-phase throughput at default context.
Qwen 3.5 9B at MLX 4-bit
| Mac | Tok/s | Bandwidth limited? |
|---|
| M1 MacBook Pro 16GB | ~15-20 | Compute-limited |
| M2 16GB | ~18-24 | Compute-limited |
| M3 16GB | ~22-28 | Balanced |
| M4 16GB | ~25-35 | Balanced |
| M4 Pro 24GB | ~32-42 | Bandwidth-limited |
| M4 Max 36GB | ~60-80 | Balanced |
| M4 Max 64GB | ~62-82 | Balanced |
| M3 Ultra 512GB | ~120-160 | Balanced |
Qwen 3.5 27B at MLX 4-bit
| Mac | Tok/s |
|---|
| M4 Pro 24GB | ~15-22 (tight fit) |
| M4 Max 36GB | ~30-42 |
| M4 Max 64GB | ~35-48 |
| M4 Max 128GB | ~38-50 |
| M3 Ultra 512GB | ~65-85 |
Qwen 3.5 35B-A3B at MLX 4-bit
| Mac | Tok/s |
|---|
| M4 Pro 24GB | ~18-25 (tight) |
| M4 Max 36GB | ~40-55 |
| M4 Max 64GB | ~55-70 |
| M4 Max 128GB | ~58-75 |
| M3 Ultra 512GB | ~80-110 |
Qwen 3.5 122B-A10B at MLX 4-bit
| Mac | Tok/s |
|---|
| M4 Max 128GB | ~20-30 |
| M4 Ultra 192GB | ~30-45 |
| M3 Ultra 256GB | ~40-55 |
| M3 Ultra 512GB | ~50-65 |
Setup — mlx-lm (CLI)
# Install
pip install mlx-lm
# Run Qwen 3.5 9B MLX 4-bit
mlx_lm.generate \
--model mlx-community/Qwen3.5-9B-MLX-4bit \
--prompt "Write a concise summary of Mixture of Experts." \
--max-tokens 512
# Run the 35B-A3B MoE
mlx_lm.generate \
--model mlx-community/Qwen3.5-35B-A3B-MLX-4bit \
--prompt "Implement a Python function to batch API requests." \
--max-tokens 1024
# Python API
python -c "
from mlx_lm import load, generate
model, tokenizer = load('mlx-community/Qwen3.5-9B-MLX-4bit')
print(generate(model, tokenizer, prompt='Hello!', max_tokens=100))
"
Setup — LM Studio (GUI)
- Install LM Studio from lmstudio.ai.
- Open the Discover tab.
- Search "Qwen 3.5".
- Filter results by "MLX" (top right filter button).
- Download the quantization matching your Mac tier (see table above).
- Load in the Chat tab or expose via Developer tab → Local Server.
LM Studio's OpenAI-compatible server on port 1234 makes it the easiest way to plug MLX Qwen 3.5 into existing agent frameworks (LangChain, CrewAI, LlamaIndex).
Fine-tuning Qwen 3.5 with MLX
MLX natively supports LoRA and QLoRA fine-tuning on Apple Silicon. Example for Qwen 3.5 9B:
# Convert HF model to MLX format (if not using a pre-quantized version)
mlx_lm.convert \
--hf-path Qwen/Qwen3.5-9B \
--mlx-path mlx_models/qwen3.5-9b \
--quantize
# Fine-tune with LoRA
mlx_lm.lora \
--model mlx_models/qwen3.5-9b \
--train \
--data data/ \
--lora-layers 16 \
--batch-size 2 \
--iters 600
Practical: a LoRA fine-tune of Qwen 3.5 9B on a M4 Max 64GB takes ~2 hours for 600 iterations with a small domain dataset. Full fine-tuning is not feasible on consumer Macs — even the 9B base would need 40+ GB for optimizer states.
Troubleshooting
- Metal shader crash on macOS 26 + Ollama 0.18.2: known bug affecting M4/M5 chips. Fix: upgrade to Ollama 0.19 or switch to MLX (unaffected).
- OOM on M4 Pro 24GB running 27B or 35B-A3B: close Chrome, Docker Desktop, IDE, and Spotlight heavy apps. macOS reserves ~3.5 GB. You have ~20.5 GB usable.
- MLX is slower than expected: verify you downloaded the MLX variant, not the GGUF. MLX models are named
mlx-community/Qwen3.5-*-MLX-4bit, not unsloth/Qwen3.5-*-GGUF.
- Context window collapses speed: at 32K+ context the KV cache overhead on a 24 GB Mac forces memory pressure. Drop to 8K-16K for interactive work.
- Download errors on mlx-community models: some have been renamed. Check the mlx-community HF org for the latest naming.
Related guides