Best Reasoning Models to Run Locally in 2026 — VRAM Guide from 2GB to 40GB
Complete guide to reasoning models you can run locally: DeepSeek R1 distills, QwQ 32B, Phi-4 Reasoning, Gemma 4 MoE. VRAM requirements, GPU tiers, and thinking token speed tips.
Reasoning models are the biggest shift in local AI since quantization. Unlike standard LLMs that produce answers in a single pass, reasoning models generate internal chain-of-thought tokens — working through problems step-by-step before delivering a final response. This extended thinking lets them solve math problems, debug code, and handle logic puzzles that stump conventional models.
The tradeoff? They produce far more tokens per query, so VRAM and inference speed matter even more than usual. This guide covers every reasoning model you can run locally in 2026, with exact VRAM requirements and GPU recommendations.
What Makes a Reasoning Model Different
Standard LLMs map input to output directly. Reasoning models add an intermediate step: they generate "thinking tokens" that explore the problem space before committing to an answer. You can often see this in the output — a <think> block with the model's internal deliberation, followed by the actual response.
This approach — variously called chain-of-thought, extended thinking, or step-by-step reasoning — dramatically improves performance on:
- Math and science (AIME, MATH, GPQA benchmarks)
- Code generation and debugging (LiveCodeBench, HumanEval)
- Logic and planning (multi-step reasoning problems)
- Analysis (comparing options, evaluating tradeoffs)
The improvement is not marginal. On AIME 2025, DeepSeek R1 32B scores 72% compared to a standard 32B model scoring around 20-30%. The reasoning architecture transforms what these models can do.
The catch: a single query might generate 500-2000 thinking tokens before producing a 200-token answer. Speed and VRAM are critical.
Reasoning Models Compared — VRAM and Benchmarks
Here is every major reasoning model you can run locally, with VRAM at Q4 quantization and key benchmark scores:
| Model | Params | VRAM (Q4) | AIME | MATH | Speed Factor |
|---|---|---|---|---|---|
| Phi-4 Mini Reasoning 4B | 3.8B | ~2.3 GB | 50% | 82% | Very fast |
| DeepSeek R1 7B | 7B | ~4.0 GB | 55% | 83% | Fast |
| Phi-4 Reasoning Plus 14B | 14B | ~8.0 GB | 75% | 90% | Moderate |
| DeepSeek R1 14B | 14B | ~7.8 GB | 69% | 88% | Moderate |
| Gemma 4 26B-A4B | 25.2B (MoE) | ~15 GB | 89% | 93% | Fast (4B active) |
| QwQ 32B | 32B | ~18 GB | 79% | 95% | Moderate |
| DeepSeek R1 32B | 32B | ~18 GB | 72% | 92% | Moderate |
| Gemma 4 31B | 30.7B | ~18 GB | 82% | 94% | Slow |
| DeepSeek R1 70B | 70B | ~40 GB | 80% | 94% | Slow |
| DeepSeek R1 671B | 671B (MoE) | ~376 GB | 96% | 97% | Multi-GPU only |
Speed Factor reflects approximate tokens/second relative to model size. MoE models like Gemma 4 26B run faster than their total parameter count suggests because only a fraction of parameters activate per token.
The Thinking Token Problem
Reasoning models generate far more tokens than standard models for the same query. A typical interaction looks like:
- Prompt: 100 tokens
- Thinking tokens (hidden): 500-2000 tokens
- Visible response: 100-400 tokens
This means total generation is 3-10x what a standard model would produce. A query that takes 5 seconds on a regular 14B model might take 20-40 seconds on a reasoning 14B model — not because it is slower per token, but because it generates many more tokens.
The practical implication: tokens per second matters more for reasoning models than any other model type. A model that runs at 20 tok/s feels responsive; one that runs at 5 tok/s feels painfully slow when generating 1500 thinking tokens.
This is why the "Speed Factor" column in the table above matters. Gemma 4 26B-A4B is a standout here — its MoE architecture means only 3.8B parameters are active per token, so it generates tokens at roughly the speed of a 4B model while delivering 26B-class quality.
GPU Recommendations by Tier
8GB VRAM — Entry-Level Reasoning
With 8GB you have genuine reasoning capability, not just toy models.
Best picks:
- DeepSeek R1 7B at Q8 (7.4GB) — Full quality, fast generation
- Phi-4 Reasoning Plus 14B at Q4 (~8GB) — Tight fit, excellent reasoning
- Phi-4 Mini Reasoning 4B at Q8 (~4.3GB) — Leaves room for large contexts
Recommended GPUs: RTX 4060 8GB, RTX 3060 8GB, any Mac with 16GB+ unified memory
12GB VRAM — The Sweet Spot for 14B
A 12GB GPU opens up 14B reasoning models at good quantization, which is a meaningful step up in quality.
Best picks:
- Phi-4 Reasoning Plus 14B at Q6 (~11GB) — Near-lossless quality
- DeepSeek R1 14B at Q5 (~10GB) — Strong reasoning, comfortable fit
- DeepSeek R1 7B at Q8 (7.4GB) — Fast with huge context headroom
Recommended GPUs: RTX 4070 12GB, RTX 3060 12GB
16GB VRAM — MoE Reasoning Unlocked
At 16GB, the Gemma 4 26B MoE model becomes available — arguably the best reasoning value in 2026.
Best picks:
- Gemma 4 26B-A4B at Q4 (~15GB) — Frontier reasoning, 4B-speed generation
- Phi-4 Reasoning Plus 14B at Q8 (~15GB) — Maximum 14B quality
- DeepSeek R1 14B at Q8 (~15GB) — Same tier, different flavor
Recommended GPUs: RTX 4060 Ti 16GB, RTX 5060 16GB, Mac with 32GB unified memory
24GB VRAM — 32B Reasoning Models
The 24GB tier unlocks the most capable single-GPU reasoning models: QwQ 32B and DeepSeek R1 32B.
Best picks:
- QwQ 32B at Q4 (~18GB) — Alibaba's reasoning specialist, excellent at math and code
- DeepSeek R1 32B at Q4 (~18GB) — Distilled from the 671B flagship
- Gemma 4 31B at Q4 (~18GB) — Dense model with configurable reasoning mode
- Gemma 4 26B-A4B at Q6 (~21GB) — Higher quality MoE with fast generation
Recommended GPUs: RTX 4090 24GB, RTX 5080 24GB, RTX 3090 24GB
Use our hardware calculator to check exact fit for your specific GPU and model combination.
Model Deep Dives
DeepSeek R1 Distills — The Reasoning Standard
The DeepSeek R1 family set the standard for open reasoning models. The distilled variants (7B, 14B, 32B, 70B) were trained using knowledge distillation from the full 671B model, which means they inherit genuine reasoning patterns rather than just being fine-tuned on reasoning data.
The 32B distill is the local favorite — it balances quality and VRAM better than any other R1 variant. The 70B needs ~40GB at Q4, putting it in multi-GPU or high-memory Mac territory.
QwQ 32B — Alibaba's Reasoning Specialist
QwQ 32B from Alibaba (Qwen team) is a dedicated reasoning model that competes head-to-head with DeepSeek R1 32B. It scores 79% on AIME 2025 and 95% on MATH, making it one of the strongest 32B-class reasoning models available. At ~18GB Q4, it shares the same hardware tier as R1 32B.
Phi-4 Reasoning Plus — Microsoft's Efficient Reasoner
Microsoft's Phi-4 Reasoning Plus 14B punches above its weight. It scores 75% on AIME at just 8GB Q4 — half the VRAM of a 32B model. The smaller Phi-4 Mini Reasoning at 3.8B parameters needs only 2.3GB at Q4, making it the lightest genuine reasoning model available.
Gemma 4 — Configurable Reasoning Mode
Google's Gemma 4 models are not reasoning-first models, but the 26B-A4B and 31B variants support a configurable reasoning mode that enables extended thinking. The 26B MoE variant is particularly interesting: at 15GB Q4 it scores 89% on AIME 2026 while generating tokens at the speed of a 4B model thanks to its Mixture of Experts architecture.
Quantization Matters More for Reasoning
Since reasoning models generate more tokens per query, the quality-per-token of each generated token compounds. A small quality loss from aggressive quantization (Q2, Q3) can degrade the chain-of-thought process, causing the model to lose track of its reasoning or make more errors mid-thought.
Recommendation: stick to Q4_K_M or higher for reasoning models. The VRAM savings from Q2/Q3 are not worth the reasoning quality loss. For a deeper explanation of quantization formats and their tradeoffs, see our quantization guide.
Quick Setup
All of these models run through standard local inference tools. The fastest path:
# DeepSeek R1 32B — best all-around reasoning
ollama run deepseek-r1:32b
# QwQ 32B — strong alternative
ollama run qwq:32b
# Phi-4 Mini Reasoning — tiny but capable
ollama run phi4-mini-reasoning
# Gemma 4 26B MoE — fast reasoning via MoE
ollama run gemma4:26b
For llama.cpp users, grab GGUF files from HuggingFace and use the --temp 0.6 flag — reasoning models benefit from slightly lower temperature to keep the chain of thought focused.
Bottom Line
The reasoning model landscape in 2026 gives you real options at every VRAM tier. If you have 8GB, Phi-4 Reasoning Plus 14B at Q4 delivers genuine extended thinking. If you have 16GB, Gemma 4 26B-A4B is the best reasoning-per-VRAM model available thanks to its MoE architecture. If you have 24GB, QwQ 32B and DeepSeek R1 32B are the gold standard for local reasoning.
The key insight: reasoning models generate 3-10x more tokens than standard models, so prioritize speed. A smaller model running fast often beats a larger model running slow for reasoning tasks. Use our calculator to find the right fit for your hardware.