What is the best local coding LLM for 24GB Apple Silicon?

Qwen 3 Coder 30B-A3B MoE is the top pick for 24GB unified memory in April 2026. It runs at Q4_K_M (~17 GB), delivers ~30-35 tok/s on M4 Pro, and outperforms DeepSeek V2.5 Coder Lite on SWE-bench and real-world multi-file refactors. Second pick: Qwen 3.5 35B-A3B at Q4_K_M.

Can I run Qwen 3 Coder 30B-A3B on MacBook Pro M4 24GB?

Yes — fits at Q4_K_M with 7 GB headroom for KV cache and system RAM. Expect ~30-35 tok/s sustained on the Pro M4 (active cooling), ~25-30 tok/s on the Air M4 with thermal throttling.

Is Qwen 3 Coder 14B better than Llama 3.1 70B at Q4 for coding?

For single-function tasks they are similar. For repo-level refactors, context-aware edits, and agentic coding workflows, Qwen 3 Coder 14B wins — it is fine-tuned specifically on code and handles multi-file reasoning better than general-purpose 70B at tight quantization.

What quantization is best for coding on 24GB Mac?

Prefer Q5_K_M or higher for code generation. Syntax-sensitive outputs benefit from precision. Q4_K_M is acceptable for chat-style coding assistance but expect occasional bracket/quote errors on complex functions.

Does DeepSeek Coder V3 run on 24GB unified memory?

The full DeepSeek Coder V3 (MoE, 128B active) does not fit. DeepSeek Coder V2.5 Lite at Q4_K_M (~14 GB) runs fine and is a strong fallback if you prefer DeepSeek's coding style over Qwen's.

April 22, 2026apple-silicon, coding, local-llm, qwen-coder, deepseek, buyer-guide

Best Coding LLMs for Apple Silicon 24GB — Ranked 2026

Top local coding LLMs for 24GB Apple Silicon (M4 Pro, M3 Pro): Qwen3 Coder 30B, Qwen3.5-35B-A3B, DeepSeek Coder V2.5 ranked by SWE-bench and tok/s.

Best local coding LLMs for 24GB Apple Silicon in 2026 — ranked picks for M4 Pro, M4 Max 36GB, and M3 Pro, with tok/s estimates, recommended quantization, and integration notes for Cursor / Continue.dev / VSCode.

For the ranked model list against your specific hardware, see:

Top coding picks at 24GB unified memory

Rank	Model	VRAM Q4	tok/s (M4 Pro)	Best for
1	Qwen 3 Coder 30B-A3B	~17 GB	~30-35	Overall coding champion; MoE sparsity keeps inference fast
2	Qwen 3.5 35B-A3B	~21 GB	~30	Tight but strong general+coding MoE
3	Qwen 3 Coder 14B	~8 GB	~55	Fastest respectable coding model; perfect for Cursor-style flows
4	Qwen 3.5 27B	~16 GB	~35	Dense alternative; more predictable latency
5	DeepSeek Coder V2.5 Lite	~14 GB	~40	Different style, strong on Python/TS
6	Qwen 3 14B	~8 GB	~50	Not fine-tuned for code but fast and capable
7	Gemma 3 9B	~6 GB	~60	Lightweight fallback; good for quick Q&A

Why Qwen 3 Coder 30B-A3B wins

The MoE architecture (30B total, 3B active per token) gives it the knowledge breadth of a 30B dense model while running at the speed of a 3B dense model. On a 24GB M4 Pro Mac you get:

~17 GB loaded into unified memory
~7 GB headroom for KV cache and macOS/apps
~30-35 tok/s sustained (active-cooled Pro)
Full 262K context without extra memory pressure

For repo-level refactors and agentic workflows (where the model generates multiple tool-calls per turn), this combination is unmatched at 24GB.

When to pick Qwen 3.5 35B-A3B instead

If you want the general-purpose MoE (chat + coding + reasoning), Qwen 3.5 35B-A3B edges out Qwen 3 Coder 30B-A3B on non-code tasks. Coding performance is very close. The cost is ~4 GB more VRAM — on 24GB Macs this means fewer open apps during sessions.

When open weights ship, Qwen3.6-35B-A3B will inherit this slot with the added 1M-context capability for agentic coding.

Quantization: why you want Q5_K_M for code

Code is syntax-sensitive. A missing bracket or quote character due to aggressive quantization destroys the output. Q4_K_M is acceptable for chat-style coding assistance but we have seen reliable quality gains moving to Q5_K_M or Q6_K:

Quant	30B-A3B VRAM	Code quality delta vs FP16
Q4_K_M	~17 GB	-3 to -5% (occasional syntax slips)
Q5_K_M	~20 GB	-1 to -2% (effectively identical for most tasks)
Q6_K	~24 GB	< -1% (near-lossless; won't fit 30B-A3B on 24GB Mac)
Q8_0	~32 GB	No measurable delta (requires 32GB+ Mac)

On a 24GB Mac, stick with Q4_K_M for the 30B-A3B class. If you have a 36GB+ Mac, step up to Q5 or Q6.

Integration with coding tools

All of the picks above expose an OpenAI-compatible endpoint via Ollama or LM Studio, so any tool that speaks OpenAI works.

Ollama (recommended):

ollama pull qwen3-coder:30b-a3b
ollama run qwen3-coder:30b-a3b
# endpoint: http://localhost:11434/v1

LM Studio: Search Qwen3-Coder-30B-A3B-Instruct-GGUF, pick Q4_K_M, start server.

Cursor:

Settings → Models → Add custom model
Base URL: http://localhost:11434/v1
Model: qwen3-coder:30b-a3b

Continue.dev (VSCode):

{
  "models": [
    {
      "title": "Qwen 3 Coder 30B-A3B (local)",
      "provider": "ollama",
      "model": "qwen3-coder:30b-a3b"
    }
  ]
}

MLX vs GGUF on Apple Silicon

MLX (Apple's native ML framework) delivers ~15-25% faster tok/s than llama.cpp GGUF on M-series chips.
GGUF is more mature, has wider tool support (Ollama, LM Studio, Continue.dev out of the box), and the ecosystem is larger.
Recommendation for 2026: Start with GGUF via Ollama for ease of use. If you hit bandwidth limits and want the extra tok/s, switch to MLX with mlx-community models — see our Qwen 3.5 MLX guide.

What about coding on smaller Macs (16 GB)?

If you have a 16 GB Mac, the coding LLM roster is different — see Best AI models for a 16GB Mac for the tailored list. Short version: Qwen 3 Coder 14B at Q4_K_M or Gemma 4 E4B at Q8 are the daily drivers.