Will It Run AI
qwen, qwen-3-5, mlx, apple-silicon, macbook, mac-studio, unified-memory

Qwen 3.5 on Apple Silicon MLX (2026) — Memory Usage, Tokens/sec & Setup Guide

Qwen 3.5 35B-A3B MLX 4-bit uses ~19.5 GB on Mac. 9B fits 16 GB Macs at 25-35 tok/s. Full MLX memory + speed benchmarks per variant from M4 16GB to M3 Ultra 512GB.

This guide covers running Qwen 3.5 specifically on Apple Silicon via MLX, with exact memory numbers per variant and benchmarks from MacBook Air M4 16GB through Mac Studio M3 Ultra 512GB. For the generic MLX-vs-Ollama framework comparison, see MLX vs Ollama on Apple Silicon.

Quick answers — which Qwen 3.5 for which Mac

MacBest Qwen 3.5 variantMLX quantPeak memoryTok/s
M4 16GB MacBook Air9B4-bit~7 GB25-35
M4 Pro 24GB9B Q8 or 27B 4-bit8-bit / 4-bit~10 / ~18 GB30-40 / 18-25
M4 Max 36GB35B-A3B 4-bit4-bit~21 GB40-55
M4 Max 64GB35B-A3B 6-bit6-bit~29 GB50-65
M4 Max 128GB122B-A10B 4-bit4-bit~75 GB20-30
M4 Ultra 192GB122B-A10B 5-bit5-bit~88 GB30-45
M3 Ultra 256GB122B-A10B 8-bit8-bit~131 GB40-55
M3 Ultra 512GB122B-A10B FP16 or 35B-A3B 8-bit8-bit / FP16flexible50-80

Memory usage — MLX quantizations per variant

These are peak unified memory usage numbers during inference (model weights + KV cache at default 4K context). Add ~1 GB per 8K additional context.

Qwen 3.5 9B dense

MLX quantPeak memoryQualityNotes
4-bit~6.0 GBMinor lossFits on 16 GB Macs with room for apps
6-bit~7.9 GBNear-losslessComfortable on 16 GB+ Macs
8-bit~10.4 GBEffectively FP16Needs 24 GB+ for comfortable context
FP16~20.3 GBReferenceNeeds 32 GB+ Macs (M4 Max tier)

Qwen 3.5 27B dense

MLX quantPeak memoryQualityNotes
4-bit~16.8 GBMinor lossTight on 24 GB Macs; comfortable on 36 GB+
6-bit~22.5 GBNear-lossless36 GB+ recommended
8-bit~29.2 GBEffectively FP1648 GB+ (Mac Studio tier)
FP16~56.2 GBReference64 GB+

Qwen 3.5 35B-A3B MoE

MLX quantPeak memoryQualityNotes
4-bit~19.5-20.5 GBMinor lossFits M4 Max 36GB comfortably
4-bit (VLM variant)~20.5-22.0 GBMinor lossIncludes vision encoder
6-bit~28.9 GBNear-losslessM4 Max 36GB comfortable, 64GB preferred
8-bit~38.0 GBEffectively FP1664 GB+
Extreme 2-bit (community)~11.3-12.5 GBMajor loss but usableFits on M4 16GB (tight)

Community note: one MLX build ("Qwen3.5-35B-A3B-Alis-MLX-Dynamic-2.6bpw-VLM") uses extreme dynamic quantization (Expert 2-bit + Attention/GDN 3-bit) to fit 35B-A3B into 16 GB. Quality drops meaningfully but inference works at ~5-6 tok/s on M4 16GB.

Qwen 3.5 122B-A10B MoE

MLX quantPeak memoryQualityNotes
4-bit~75 GBMinor lossFits Mac Studio M4 Max 128GB
5-bit~89 GBNear-losslessMac Studio Ultra 192GB+
6-bit~102 GBNear-losslessMac Studio Ultra 192GB+
8-bit~132 GBEffectively FP16M3 Ultra 256GB+
FP16~252 GBReferenceM3 Ultra 512GB

Tokens/second — per Mac and variant

Numbers triangulated from r/LocalLLaMA community reports, Ollama 0.19 release benchmarks, and Apple Silicon LLM inference guides. Decode-phase throughput at default context.

Qwen 3.5 9B at MLX 4-bit

MacTok/sBandwidth limited?
M1 MacBook Pro 16GB~15-20Compute-limited
M2 16GB~18-24Compute-limited
M3 16GB~22-28Balanced
M4 16GB~25-35Balanced
M4 Pro 24GB~32-42Bandwidth-limited
M4 Max 36GB~60-80Balanced
M4 Max 64GB~62-82Balanced
M3 Ultra 512GB~120-160Balanced

Qwen 3.5 27B at MLX 4-bit

MacTok/s
M4 Pro 24GB~15-22 (tight fit)
M4 Max 36GB~30-42
M4 Max 64GB~35-48
M4 Max 128GB~38-50
M3 Ultra 512GB~65-85

Qwen 3.5 35B-A3B at MLX 4-bit

MacTok/s
M4 Pro 24GB~18-25 (tight)
M4 Max 36GB~40-55
M4 Max 64GB~55-70
M4 Max 128GB~58-75
M3 Ultra 512GB~80-110

Qwen 3.5 122B-A10B at MLX 4-bit

MacTok/s
M4 Max 128GB~20-30
M4 Ultra 192GB~30-45
M3 Ultra 256GB~40-55
M3 Ultra 512GB~50-65

Setup — mlx-lm (CLI)

# Install
pip install mlx-lm

# Run Qwen 3.5 9B MLX 4-bit
mlx_lm.generate \
  --model mlx-community/Qwen3.5-9B-MLX-4bit \
  --prompt "Write a concise summary of Mixture of Experts." \
  --max-tokens 512

# Run the 35B-A3B MoE
mlx_lm.generate \
  --model mlx-community/Qwen3.5-35B-A3B-MLX-4bit \
  --prompt "Implement a Python function to batch API requests." \
  --max-tokens 1024

# Python API
python -c "
from mlx_lm import load, generate
model, tokenizer = load('mlx-community/Qwen3.5-9B-MLX-4bit')
print(generate(model, tokenizer, prompt='Hello!', max_tokens=100))
"

Setup — LM Studio (GUI)

  1. Install LM Studio from lmstudio.ai.
  2. Open the Discover tab.
  3. Search "Qwen 3.5".
  4. Filter results by "MLX" (top right filter button).
  5. Download the quantization matching your Mac tier (see table above).
  6. Load in the Chat tab or expose via Developer tab → Local Server.

LM Studio's OpenAI-compatible server on port 1234 makes it the easiest way to plug MLX Qwen 3.5 into existing agent frameworks (LangChain, CrewAI, LlamaIndex).

Fine-tuning Qwen 3.5 with MLX

MLX natively supports LoRA and QLoRA fine-tuning on Apple Silicon. Example for Qwen 3.5 9B:

# Convert HF model to MLX format (if not using a pre-quantized version)
mlx_lm.convert \
  --hf-path Qwen/Qwen3.5-9B \
  --mlx-path mlx_models/qwen3.5-9b \
  --quantize

# Fine-tune with LoRA
mlx_lm.lora \
  --model mlx_models/qwen3.5-9b \
  --train \
  --data data/ \
  --lora-layers 16 \
  --batch-size 2 \
  --iters 600

Practical: a LoRA fine-tune of Qwen 3.5 9B on a M4 Max 64GB takes ~2 hours for 600 iterations with a small domain dataset. Full fine-tuning is not feasible on consumer Macs — even the 9B base would need 40+ GB for optimizer states.

Troubleshooting

  • Metal shader crash on macOS 26 + Ollama 0.18.2: known bug affecting M4/M5 chips. Fix: upgrade to Ollama 0.19 or switch to MLX (unaffected).
  • OOM on M4 Pro 24GB running 27B or 35B-A3B: close Chrome, Docker Desktop, IDE, and Spotlight heavy apps. macOS reserves ~3.5 GB. You have ~20.5 GB usable.
  • MLX is slower than expected: verify you downloaded the MLX variant, not the GGUF. MLX models are named mlx-community/Qwen3.5-*-MLX-4bit, not unsloth/Qwen3.5-*-GGUF.
  • Context window collapses speed: at 32K+ context the KV cache overhead on a 24 GB Mac forces memory pressure. Drop to 8K-16K for interactive work.
  • Download errors on mlx-community models: some have been renamed. Check the mlx-community HF org for the latest naming.

Related guides

Frequently Asked Questions

How much memory does Qwen 3.5 35B-A3B use in MLX?

Qwen 3.5 35B-A3B MLX 4-bit peaks at ~19.5 GB of unified memory during inference (weights 19.5 GB + KV cache ~1-3 GB depending on context). The VLM variant peaks at ~20.5 GB. This fits comfortably on M4 Max 36GB and above. On M4 Pro 24GB it is marginal — close Chrome and other memory-hungry apps before loading.

What is the fastest Mac for Qwen 3.5?

M3 Ultra 512GB is the fastest Mac for Qwen 3.5 — it reaches 80+ tok/s on Qwen 3.5 35B-A3B at MLX 8-bit thanks to 800 GB/s unified memory bandwidth. For consumer Macs, M4 Max 64GB delivers ~55-70 tok/s on the same model. MacBook Air M4 16GB runs the 9B variant at 25-35 tok/s — real interactive speed.

Can MacBook Air M4 16GB run Qwen 3.5?

Yes. MacBook Air M4 16GB runs Qwen 3.5 9B at MLX 4-bit (5.5 GB model + 1-2 GB cache) at 25-35 tokens/second. You will need to close heavy apps (Chrome, Docker) to leave headroom. For Qwen 3.5 27B or the 35B-A3B MoE, you need 24GB+ unified memory.

MLX vs GGUF on Mac for Qwen 3.5 — which should I use?

MLX for best performance. MLX uses 10% less memory than GGUF on Mac and runs 15-30% faster on the same quantization. Use GGUF via Ollama only if you need cross-platform workflows (running the same file on Linux + Mac) or if you want the simplest one-command setup.

What MLX quantization should I pick for Qwen 3.5?

MLX 4-bit for 16 GB+ Macs (minor quality loss, maximum speed). MLX 6-bit for 24 GB+ Macs if you want near-lossless quality. MLX 8-bit for 48 GB+ Macs (Mac Studio tier) — effectively identical to full precision. For fine-tuning with mlx-lm.lora, stick to 4-bit or 8-bit base models.

Where do I download MLX models for Qwen 3.5?

From the mlx-community Hugging Face org: mlx-community/Qwen3.5-9B-MLX-4bit, mlx-community/Qwen3.5-27B-MLX-4bit, mlx-community/Qwen3.5-35B-A3B-MLX-4bit, mlx-community/Qwen3.5-122B-A10B-MLX-4bit. LM Studio's Discover tab lists them with a clear 'MLX' filter. For Python use mlx_lm.generate --model mlx-community/Qwen3.5-*-MLX-4bit.

Does MLX work on M1 Macs?

Yes. MLX runs on all Apple Silicon from M1 onwards. An M1 MacBook Pro 16GB runs Qwen 3.5 4B or Qwen 3.5 9B (tight) at usable speeds. Older M1 chips have 100 GB/s bandwidth vs 546 GB/s on M4 Max — so the relative advantage of larger memory Macs is significant for large models.