How much memory does Qwen 3.5 35B-A3B use in MLX?

Qwen 3.5 35B-A3B MLX 4-bit peaks at ~19.5 GB of unified memory during inference (weights 19.5 GB + KV cache ~1-3 GB depending on context). The VLM variant peaks at ~20.5 GB. This fits comfortably on M4 Max 36GB and above. On M4 Pro 24GB it is marginal — close Chrome and other memory-hungry apps before loading.

What is the fastest Mac for Qwen 3.5?

M3 Ultra 512GB is the fastest Mac for Qwen 3.5 — it reaches 80+ tok/s on Qwen 3.5 35B-A3B at MLX 8-bit thanks to 800 GB/s unified memory bandwidth. For consumer Macs, M4 Max 64GB delivers ~55-70 tok/s on the same model. MacBook Air M4 16GB runs the 9B variant at 25-35 tok/s — real interactive speed.

Can MacBook Air M4 16GB run Qwen 3.5?

Yes. MacBook Air M4 16GB runs Qwen 3.5 9B at MLX 4-bit (5.5 GB model + 1-2 GB cache) at 25-35 tokens/second. You will need to close heavy apps (Chrome, Docker) to leave headroom. For Qwen 3.5 27B or the 35B-A3B MoE, you need 24GB+ unified memory.

MLX vs GGUF on Mac for Qwen 3.5 — which should I use?

MLX for best performance. MLX uses 10% less memory than GGUF on Mac and runs 15-30% faster on the same quantization. Use GGUF via Ollama only if you need cross-platform workflows (running the same file on Linux + Mac) or if you want the simplest one-command setup.

What MLX quantization should I pick for Qwen 3.5?

MLX 4-bit for 16 GB+ Macs (minor quality loss, maximum speed). MLX 6-bit for 24 GB+ Macs if you want near-lossless quality. MLX 8-bit for 48 GB+ Macs (Mac Studio tier) — effectively identical to full precision. For fine-tuning with mlx-lm.lora, stick to 4-bit or 8-bit base models.

Where do I download MLX models for Qwen 3.5?

From the mlx-community Hugging Face org: mlx-community/Qwen3.5-9B-MLX-4bit, mlx-community/Qwen3.5-27B-MLX-4bit, mlx-community/Qwen3.5-35B-A3B-MLX-4bit, mlx-community/Qwen3.5-122B-A10B-MLX-4bit. LM Studio's Discover tab lists them with a clear 'MLX' filter. For Python use mlx_lm.generate --model mlx-community/Qwen3.5-*-MLX-4bit.

Does MLX work on M1 Macs?

Yes. MLX runs on all Apple Silicon from M1 onwards. An M1 MacBook Pro 16GB runs Qwen 3.5 4B or Qwen 3.5 9B (tight) at usable speeds. Older M1 chips have 100 GB/s bandwidth vs 546 GB/s on M4 Max — so the relative advantage of larger memory Macs is significant for large models.

April 20, 2026qwen, qwen-3-5, mlx, apple-silicon, macbook, mac-studio, unified-memory

Qwen 3.5 on Apple Silicon MLX (2026) — Memory Usage, Tokens/sec & Setup Guide

Qwen 3.5 35B-A3B MLX 4-bit uses ~19.5 GB on Mac. 9B fits 16 GB Macs at 25-35 tok/s. Full MLX memory + speed benchmarks per variant from M4 16GB to M3 Ultra 512GB.

This guide covers running Qwen 3.5 specifically on Apple Silicon via MLX, with exact memory numbers per variant and benchmarks from MacBook Air M4 16GB through Mac Studio M3 Ultra 512GB. For the generic MLX-vs-Ollama framework comparison, see MLX vs Ollama on Apple Silicon.

Quick answers — which Qwen 3.5 for which Mac

Mac	Best Qwen 3.5 variant	MLX quant	Peak memory	Tok/s
M4 16GB MacBook Air	9B	4-bit	~7 GB	25-35
M4 Pro 24GB	9B Q8 or 27B 4-bit	8-bit / 4-bit	~10 / ~18 GB	30-40 / 18-25
M4 Max 36GB	35B-A3B 4-bit	4-bit	~21 GB	40-55
M4 Max 64GB	35B-A3B 6-bit	6-bit	~29 GB	50-65
M4 Max 128GB	122B-A10B 4-bit	4-bit	~75 GB	20-30
M4 Ultra 192GB	122B-A10B 5-bit	5-bit	~88 GB	30-45
M3 Ultra 256GB	122B-A10B 8-bit	8-bit	~131 GB	40-55
M3 Ultra 512GB	122B-A10B FP16 or 35B-A3B 8-bit	8-bit / FP16	flexible	50-80

Memory usage — MLX quantizations per variant

These are peak unified memory usage numbers during inference (model weights + KV cache at default 4K context). Add ~1 GB per 8K additional context.

Qwen 3.5 9B dense

MLX quant	Peak memory	Quality	Notes
4-bit	~6.0 GB	Minor loss	Fits on 16 GB Macs with room for apps
6-bit	~7.9 GB	Near-lossless	Comfortable on 16 GB+ Macs
8-bit	~10.4 GB	Effectively FP16	Needs 24 GB+ for comfortable context
FP16	~20.3 GB	Reference	Needs 32 GB+ Macs (M4 Max tier)

Qwen 3.5 27B dense

MLX quant	Peak memory	Quality	Notes
4-bit	~16.8 GB	Minor loss	Tight on 24 GB Macs; comfortable on 36 GB+
6-bit	~22.5 GB	Near-lossless	36 GB+ recommended
8-bit	~29.2 GB	Effectively FP16	48 GB+ (Mac Studio tier)
FP16	~56.2 GB	Reference	64 GB+

Qwen 3.5 35B-A3B MoE

MLX quant	Peak memory	Quality	Notes
4-bit	~19.5-20.5 GB	Minor loss	Fits M4 Max 36GB comfortably
4-bit (VLM variant)	~20.5-22.0 GB	Minor loss	Includes vision encoder
6-bit	~28.9 GB	Near-lossless	M4 Max 36GB comfortable, 64GB preferred
8-bit	~38.0 GB	Effectively FP16	64 GB+
Extreme 2-bit (community)	~11.3-12.5 GB	Major loss but usable	Fits on M4 16GB (tight)

Community note: one MLX build ("Qwen3.5-35B-A3B-Alis-MLX-Dynamic-2.6bpw-VLM") uses extreme dynamic quantization (Expert 2-bit + Attention/GDN 3-bit) to fit 35B-A3B into 16 GB. Quality drops meaningfully but inference works at ~5-6 tok/s on M4 16GB.

Qwen 3.5 122B-A10B MoE

MLX quant	Peak memory	Quality	Notes
4-bit	~75 GB	Minor loss	Fits Mac Studio M4 Max 128GB
5-bit	~89 GB	Near-lossless	Mac Studio Ultra 192GB+
6-bit	~102 GB	Near-lossless	Mac Studio Ultra 192GB+
8-bit	~132 GB	Effectively FP16	M3 Ultra 256GB+
FP16	~252 GB	Reference	M3 Ultra 512GB

Tokens/second — per Mac and variant

Numbers triangulated from r/LocalLLaMA community reports, Ollama 0.19 release benchmarks, and Apple Silicon LLM inference guides. Decode-phase throughput at default context.

Qwen 3.5 9B at MLX 4-bit

Mac	Tok/s	Bandwidth limited?
M1 MacBook Pro 16GB	~15-20	Compute-limited
M2 16GB	~18-24	Compute-limited
M3 16GB	~22-28	Balanced
M4 16GB	~25-35	Balanced
M4 Pro 24GB	~32-42	Bandwidth-limited
M4 Max 36GB	~60-80	Balanced
M4 Max 64GB	~62-82	Balanced
M3 Ultra 512GB	~120-160	Balanced

Qwen 3.5 27B at MLX 4-bit

Mac	Tok/s
M4 Pro 24GB	~15-22 (tight fit)
M4 Max 36GB	~30-42
M4 Max 64GB	~35-48
M4 Max 128GB	~38-50
M3 Ultra 512GB	~65-85

Qwen 3.5 35B-A3B at MLX 4-bit

Mac	Tok/s
M4 Pro 24GB	~18-25 (tight)
M4 Max 36GB	~40-55
M4 Max 64GB	~55-70
M4 Max 128GB	~58-75
M3 Ultra 512GB	~80-110

Qwen 3.5 122B-A10B at MLX 4-bit

Mac	Tok/s
M4 Max 128GB	~20-30
M4 Ultra 192GB	~30-45
M3 Ultra 256GB	~40-55
M3 Ultra 512GB	~50-65

Setup — mlx-lm (CLI)

# Install
pip install mlx-lm

# Run Qwen 3.5 9B MLX 4-bit
mlx_lm.generate \
  --model mlx-community/Qwen3.5-9B-MLX-4bit \
  --prompt "Write a concise summary of Mixture of Experts." \
  --max-tokens 512

# Run the 35B-A3B MoE
mlx_lm.generate \
  --model mlx-community/Qwen3.5-35B-A3B-MLX-4bit \
  --prompt "Implement a Python function to batch API requests." \
  --max-tokens 1024

# Python API
python -c "
from mlx_lm import load, generate
model, tokenizer = load('mlx-community/Qwen3.5-9B-MLX-4bit')
print(generate(model, tokenizer, prompt='Hello!', max_tokens=100))
"

Setup — LM Studio (GUI)

Install LM Studio from lmstudio.ai.
Open the Discover tab.
Search "Qwen 3.5".
Filter results by "MLX" (top right filter button).
Download the quantization matching your Mac tier (see table above).
Load in the Chat tab or expose via Developer tab → Local Server.

LM Studio's OpenAI-compatible server on port 1234 makes it the easiest way to plug MLX Qwen 3.5 into existing agent frameworks (LangChain, CrewAI, LlamaIndex).

Fine-tuning Qwen 3.5 with MLX

MLX natively supports LoRA and QLoRA fine-tuning on Apple Silicon. Example for Qwen 3.5 9B:

# Convert HF model to MLX format (if not using a pre-quantized version)
mlx_lm.convert \
  --hf-path Qwen/Qwen3.5-9B \
  --mlx-path mlx_models/qwen3.5-9b \
  --quantize

# Fine-tune with LoRA
mlx_lm.lora \
  --model mlx_models/qwen3.5-9b \
  --train \
  --data data/ \
  --lora-layers 16 \
  --batch-size 2 \
  --iters 600

Practical: a LoRA fine-tune of Qwen 3.5 9B on a M4 Max 64GB takes ~2 hours for 600 iterations with a small domain dataset. Full fine-tuning is not feasible on consumer Macs — even the 9B base would need 40+ GB for optimizer states.

Troubleshooting

Metal shader crash on macOS 26 + Ollama 0.18.2: known bug affecting M4/M5 chips. Fix: upgrade to Ollama 0.19 or switch to MLX (unaffected).
OOM on M4 Pro 24GB running 27B or 35B-A3B: close Chrome, Docker Desktop, IDE, and Spotlight heavy apps. macOS reserves ~3.5 GB. You have ~20.5 GB usable.
MLX is slower than expected: verify you downloaded the MLX variant, not the GGUF. MLX models are named mlx-community/Qwen3.5-*-MLX-4bit, not unsloth/Qwen3.5-*-GGUF.
Context window collapses speed: at 32K+ context the KV cache overhead on a 24 GB Mac forces memory pressure. Drop to 8K-16K for interactive work.
Download errors on mlx-community models: some have been renamed. Check the mlx-community HF org for the latest naming.