What LLM Can I Run Locally on My GPU or Mac? (2026 Calculator + Guide)
Find the best local LLM for your exact VRAM: 4GB, 8GB, 12GB, 16GB, 24GB, 32GB, 48GB, 64GB+. Ranked picks for each tier with quant, tok/s, and one-click calculator.
The question is no longer can you run an LLM locally — it's which one fits your hardware. This guide ranks the best local LLM for every VRAM tier in April 2026 so you can decide in 30 seconds.
Quick decision table
| Your VRAM / RAM | Best local LLM (daily driver) | VRAM Q4 | Alternative |
|---|---|---|---|
| 4-6 GB | Gemma 3 4B | ~2.5 GB | Phi-4-mini 3.8B |
| 8 GB | Qwen 3.5 9B | ~5.1 GB | Llama 3.1 8B |
| 12 GB | Qwen 3 14B | ~8.3 GB | Gemma 3 9B Q8 |
| 16 GB | Qwen 3.5 27B | ~16 GB | DeepSeek Coder V2.5 Lite |
| 24 GB | Qwen 3 30B-A3B MoE | ~17 GB | Qwen 3 Coder 30B-A3B |
| 32 GB | Qwen 3.5 35B-A3B | ~21 GB | Qwen 3 32B dense |
| 48 GB | Qwen 3.5 35B-A3B Q6 | ~28 GB | Llama 4 Scout partial |
| 64-80 GB | Qwen 3.5 122B-A10B | ~68 GB | Llama 4 Maverick |
| 128 GB+ | Qwen 3.5 397B-A17B Q4 | ~222 GB | DeepSeek V3 Q4 partial |
Jump to your tier below, or skip straight to the VRAM Calculator for a personalized answer against your exact hardware.
Popular guides
- Qwen3.6-35B-A3B VRAM Requirements — latest MoE at Q4/Q5/Q6/Q8
- Qwen3.5-35B-A3B VRAM Requirements — sweet spot for 24GB
- Qwen3.6-35B-A3B Hardware Requirements (Buyer Guide) — which GPU or Mac to buy
- MacBook Air M4 vs Pro M4 for Local LLMs — decision guide
- Best Local Coding LLMs for Apple Silicon 24GB
- Ollama Multi-GPU Support — official behavior + performance
- Best Local LLMs for MacBook Air M4 24GB — ranked Top 10
- Best Local LLMs for MacBook Pro M4 Pro 24GB
Why Run LLMs Locally?
Before diving into specs, it's worth being clear on why local inference is worth the setup overhead:
Privacy. Every prompt you send to a cloud API leaves your machine. For code, personal documents, sensitive business data, or anything you'd rather not log with a third party, local inference is the only real answer.
Cost. Cloud APIs charge per token. A busy developer hitting GPT-4 or Claude for code completions all day can rack up $50-200/month without trying. A one-time GPU investment pays for itself quickly.
Speed. Cloud APIs introduce network latency and are subject to rate limits. A local model on a good GPU streams tokens faster than you can read them.
Offline access. Flights, conference halls, client sites with locked-down networks — your local model works anywhere.
The tradeoff is real: local models require hardware, and the most capable frontier models still run best in the cloud. But the gap has closed dramatically, and for most everyday tasks, a well-chosen local model is genuinely competitive.
It All Comes Down to Memory
The single most important number for running LLMs locally is VRAM — the dedicated memory on your GPU. When people say a model "doesn't fit," they almost always mean it doesn't fit in VRAM.
Why VRAM specifically? During inference, the model's weights must be loaded entirely into memory for fast access. A 7B parameter model at Q4 quantization takes roughly 4GB. At Q8, it takes about 7.4GB. Add KV cache for context and you need a bit more headroom on top.
If a model doesn't fit in VRAM, most runtimes fall back to CPU offloading — splitting the model across GPU and system RAM. This works, but it's dramatically slower. For a smooth experience, you want the whole model in VRAM.
A few other factors matter at the margins:
- Memory bandwidth determines tokens-per-second. An RTX 4090 with 1.008 TB/s bandwidth is roughly 2x faster than an RTX 3090 at similar model sizes.
- Compute (TFLOPS) matters less than bandwidth for inference, but becomes important for batch processing or fine-tuning.
- Apple Silicon uses unified memory shared between CPU and GPU. An M4 Max with 64GB can use all 64GB for model weights — a significant advantage over discrete GPUs.
For a deep dive on memory math, see our VRAM requirements guide.
What Can You Run? A Tier-by-Tier Breakdown
4GB VRAM — Entry GPUs (GTX 1650, GTX 1660, RX 6500 XT)
4GB is tight. You're limited to small models at aggressive quantization, but "small" has become more capable than it sounds.
| Model | Quant | Est. VRAM | Notes |
|---|---|---|---|
| Phi-4-mini (3.8B) | Q4_K_M | ~2.4GB | Microsoft's punchy small model |
| Qwen 3 0.6B | Q8 | ~0.7GB | Tiny but surprisingly coherent |
| Llama 3.2 1B | Q8 | ~1.1GB | Fast, minimal footprint |
| Llama 3.2 3B | Q4_K_M | ~2.0GB | Better quality, still fits |
At 4GB you won't be running serious coding tasks or complex reasoning, but for summarization, simple Q&A, and lightweight chat these models are genuinely usable. Phi-4-mini in particular is a standout — Microsoft trained it to be efficient, and it shows.
Recommendation: Phi-4-mini Q4_K_M. Best capability-per-gigabyte at this tier.
8GB VRAM — Mid-Range (RTX 3060 8GB, RTX 4060, RX 7600)
8GB opens up the mainstream 7-8B model tier, which is where things get genuinely useful for everyday work.
| Model | Quant | Est. VRAM | Notes |
|---|---|---|---|
| Llama 3.1 8B | Q4_K_M | ~5.0GB | Meta's solid all-rounder |
| Llama 3.2 3B | Q8 | ~3.4GB | Fast with headroom for context |
| Gemma 3 4B | Q8 | ~4.5GB | Google's strong 4B model |
| Phi-4-mini | Q8 | ~4.1GB | Full quality at this tier |
| Mistral 7B | Q4_K_M | ~4.4GB | Fast, good instruction following |
The 7-8B range at Q4 is the sweet spot for 8GB cards. You get models that can write code, draft emails, explain technical concepts, and hold multi-turn conversations — all running locally at 30-60 tokens/second on a modern GPU.
A note on context: with 8GB VRAM and a 5GB model loaded, you have about 3GB left for KV cache. That limits you to roughly 4-8K context depending on the model. If you need long context, you'll need to trade down to a smaller model or upgrade VRAM.
Recommendation: Llama 3.1 8B Q4_K_M for general use. Gemma 3 4B Q8 if you want better quality per token and shorter context is fine.
12GB VRAM — The Sweet Spot (RTX 4070 12GB, RTX 3080 10GB/12GB, RX 7700 XT)
12GB hits a real inflection point. You can now run 7-8B models at full Q8 quality, or push into 13B models at Q4. The RTX 4070 is probably the most popular card in this tier.
| Model | Quant | Est. VRAM | Notes |
|---|---|---|---|
| Qwen 3 8B | Q8 | ~8.6GB | Excellent reasoning, multilingual |
| Llama 3.1 8B | Q8 | ~8.5GB | Full quality, no compromise |
| Mistral 7B | Q8 | ~7.7GB | Fast inference, good quality |
| DeepSeek R1 7B | Q5_K_M | ~5.6GB | Reasoning specialist |
| Gemma 3 12B | Q4_K_M | ~7.3GB | Google's 12B, fits cleanly |
At 12GB you're no longer making quality tradeoffs for 8B models. Running Qwen 3 8B or Llama 3.1 8B at Q8 gives you near-full model quality. The difference between Q4 and Q8 isn't always dramatic, but for code generation and complex reasoning it's noticeable.
DeepSeek R1 7B deserves a mention here: it's a reasoning model that uses chain-of-thought internally, making it substantially better than its parameter count suggests for math and logic tasks.
Recommendation: Qwen 3 8B Q8 for the best overall quality at this tier. DeepSeek R1 7B Q5_K_M if you do a lot of reasoning/math.
16GB VRAM — Upper Mid-Range (RTX 4070 Ti Super, RTX 5070, RX 7900 GRE)
16GB starts to unlock 13-14B models at useful quantization levels. This tier is a significant step up in quality, especially for coding and analytical tasks.
| Model | Quant | Est. VRAM | Notes |
|---|---|---|---|
| Qwen 3 14B | Q4_K_M | ~8.9GB | Strong reasoning and coding |
| Phi-4 14B | Q4_K_M | ~8.5GB | Microsoft's flagship 14B |
| Gemma 3 12B | Q8 | ~13.0GB | Full quality 12B |
| Llama 3.1 8B | Q8 | ~8.5GB | Fits with plenty of KV headroom |
| Mistral Nemo 12B | Q5_K_M | ~8.4GB | Long context specialist |
Qwen 3 14B at Q4 is a genuine jump in quality over 8B models — noticeably better at multi-step reasoning, code generation, and complex instructions. Phi-4 14B is Microsoft's strong entry in this class, particularly good for STEM tasks.
With 16GB you also have room for a larger KV cache, which means you can work with longer documents and maintain more context in multi-turn conversations.
Recommendation: Qwen 3 14B Q4_K_M. If you want Microsoft's flavor, Phi-4 14B Q4_K_M is nearly equivalent.
24GB VRAM — Enthusiast Tier (RTX 4090, RTX 3090 Ti, RTX 5090)
24GB is where local inference becomes genuinely impressive. This is the most popular tier for serious local AI users, and for good reason.
| Model | Quant | Est. VRAM | Notes |
|---|---|---|---|
| Qwen 3 30B-A3B | native | ~19GB | MoE — 30B params, 3B active |
| Mistral Small 24B | Q4_K_M | ~14.2GB | Excellent instruction following |
| DeepSeek R1 32B | Q4_K_M | ~19.6GB | Frontier-quality reasoning |
| Qwen 3 14B | Q8 | ~15.7GB | Full quality 14B |
| Llama 3.3 70B | Q2_K | ~23.0GB | Barely fits, quality degraded |
The headline at 24GB is Qwen 3 30B-A3B. This is a Mixture-of-Experts model that has 30B total parameters but only activates about 3B per token. It fits in ~19GB at native precision and performs like a much larger dense model. For the RTX 4090, this is the recommended daily driver — you get near-frontier quality that would have required a data center two years ago.
DeepSeek R1 32B at Q4_K_M is the other standout. It's a full reasoning model (not distilled) at 32B parameters, and at Q4 it fits in ~19.6GB. For tasks requiring deep reasoning — complex code, math, multi-step analysis — it's remarkable running on consumer hardware.
Recommendation: Qwen 3 30B-A3B for general use. DeepSeek R1 32B Q4_K_M for reasoning-heavy tasks. Compare them head to head.
32GB+ — High-End Consumer (RTX 5090 32GB, Mac M4 Pro 24-48GB)
The RTX 5090 with 32GB VRAM and Apple Silicon Macs with 24-48GB unified memory occupy similar territory, though their architectures are different.
| Model | Quant | Est. VRAM | Notes |
|---|---|---|---|
| Llama 4 Scout | Q4_K_M | ~28GB | Meta's latest, fits cleanly |
| DeepSeek R1 32B | Q8 | ~34GB | Full quality reasoning |
| Llama 3.3 70B | Q3_K_M | ~31GB | Usable 70B quality |
| Qwen 3 30B-A3B | Q8 | ~32GB | Full quality MoE |
At 32GB you can run Llama 4 Scout natively, or push into 70B models at lower quantization. Llama 3.3 70B at Q3 is a real tradeoff — quality is noticeably degraded versus Q4+ — but it's usable for tasks where size matters more than precision.
Mac M4 Pro owners with 48GB unified memory have a significant advantage: they can run 70B models at Q4 or better, which is a meaningful quality step up.
48GB+ — Workstation / Mac M4 Max (Mac M4 Max 64GB, RTX A6000, Dual GPU)
| Model | Quant | Est. VRAM | Notes |
|---|---|---|---|
| Llama 3.3 70B | Q6_K | ~57GB | Near-full quality 70B |
| Qwen 3 235B-A22B | Q4_K_M | ~148GB | Requires 192GB+ |
| DeepSeek R1 671B | Q2_K | ~178GB | Requires 192GB+ |
| Llama 3.1 70B | Q8 | ~74GB | Fits on 80GB cards |
The Mac M4 Max with 64GB is a standout here. It can run Llama 3.3 70B at Q5_K_M with good quality — something that requires a $10,000+ NVIDIA A100 in the discrete GPU world. Apple Silicon's unified memory architecture means all 64GB is available to the model.
For workstations with A6000 (48GB) or dual-GPU setups, 70B models at high quality become the daily driver.
64GB+ and Beyond — Mac M4 Ultra, Datacenter
| Model | Quant | Est. VRAM | Notes |
|---|---|---|---|
| Qwen 3 235B-A22B | Q4_K_M | ~148GB | Needs 192GB+ |
| DeepSeek R1 671B | Q2_K | ~178GB | With offloading on 128GB |
| Llama 3.1 405B | Q4_K_M | ~243GB | Full frontier scale |
At 128GB+ unified memory (Mac M4 Ultra or above), you can run the largest open-source models with some offloading. DeepSeek R1 671B at Q2_K on a Mac Studio with 192GB unified memory is genuinely possible — though Q2 quantization has real quality tradeoffs, and it won't be fast.
For most people, the 24-64GB tier is where local inference peaks in practical terms. Larger models exist, but the diminishing returns on quality relative to hardware cost become steep.
Best Models by Use Case
Coding
Good code generation requires strong reasoning and a large training corpus of code. Parameter count matters more here than for general chat.
- Limited VRAM (8-12GB): Qwen 3 8B Q5_K_M or Phi-4-mini — both handle common coding tasks well
- Mid tier (16-24GB): Qwen 3 30B-A3B (MoE, punches well above weight for code), DeepSeek Coder V2 Lite
- High end (48GB+): Llama 3.3 70B Q6 or Qwen 3 72B Q4
For coding specifically, MoE architecture models are a great deal: Qwen 3 30B-A3B activates only 3B parameters per token while training on 30B, giving you strong code quality at a fraction of the inference cost.
General Chat and Instruction Following
For everyday conversation, email drafting, summarization, and general assistant tasks, almost any 7B+ model performs well. Prioritize instruction tuning quality.
- 8GB: Llama 3.1 8B Q4_K_M — Meta's instruction tuning is excellent
- 12GB: Mistral 7B Q8 — fast, reliable, very good at following instructions
- 24GB: Mistral Small 24B Q4 — excellent instruction following at this scale
Reasoning and Math
Reasoning tasks (math problems, logic puzzles, multi-step analysis) benefit enormously from models trained with chain-of-thought reasoning. Standard models struggle here regardless of size.
- 12GB: DeepSeek R1 7B Q5_K_M — purpose-built for reasoning, surprisingly strong
- 24GB: DeepSeek R1 32B Q4_K_M — near-frontier reasoning quality locally
- 48GB+: DeepSeek R1 671B Q2 (with offloading) or Llama 3.3 70B with reasoning prompting
The DeepSeek R1 family is the clear choice for reasoning. The distilled variants (7B, 14B, 32B) bring the reasoning capabilities to consumer hardware at a fraction of the full model's size.
Creative Writing
Creative tasks are more forgiving of quantization and model size — a well-prompted 8B model can write excellent fiction. Context window matters more here.
- Any tier: Llama 3.1 8B or Mistral 7B with good system prompts
- 24GB: Mistral Small 24B — notably better at maintaining character voice and narrative coherence
- 48GB+: Llama 3.3 70B — where you start to see frontier-quality prose
For creative writing, consider investing in prompt engineering before chasing larger models. A well-prompted 8B model often outperforms a poorly prompted 70B model on creative tasks.
Understanding Quantization (Quick Version)
Quantization reduces the numerical precision used to store model weights, trading a small amount of quality for dramatically lower memory requirements.
| Format | Bits/Weight | Quality | Use When |
|---|---|---|---|
| Q2_K | ~2.6 | Noticeably degraded | Only when nothing else fits |
| Q3_K_M | ~3.4 | Usable, some degradation | Tight VRAM, large models |
| Q4_K_M | ~4.5 | Good — common sweet spot | Default for most users |
| Q5_K_M | ~5.7 | Very good | When you have headroom |
| Q6_K | ~6.6 | Excellent, near-lossless | High VRAM, max quality |
| Q8_0 | 8.0 | Effectively lossless | When VRAM isn't a constraint |
The practical guidance: Q4_K_M is the default. It's the community standard for good reason — it cuts memory requirements roughly in half versus Q8 with minimal quality impact for most tasks. If you have headroom, step up to Q5_K_M or Q6_K. Only drop below Q4 if a model genuinely won't fit otherwise.
For a full technical breakdown, see our quantization guide.
How to Get Started in 3 Minutes
The easiest way to run local models is Ollama — it handles model downloads, quantization selection, and provides an OpenAI-compatible API.
Install Ollama (Linux/Mac):
curl -fsSL https://ollama.com/install.sh | sh
Run your first model:
# For 8GB VRAM
ollama run llama3.1:8b
# For 12GB VRAM
ollama run qwen3:8b
# For 24GB VRAM
ollama run qwen3:30b-a3b
# For reasoning tasks (12GB+)
ollama run deepseek-r1:7b
Ollama automatically selects an appropriate quantization for your hardware. After the first run, models are cached locally — subsequent starts take seconds.
Use it as an API:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1:8b",
"prompt": "Explain the difference between RAG and fine-tuning",
"stream": false
}'
This API is compatible with most tools that support OpenAI's format — including Continue.dev for VS Code, Open WebUI, and dozens of others.
Alternatives worth knowing:
- LM Studio — GUI-first, great for non-technical users, excellent model browser
- llama.cpp — the underlying engine, most control, slightly more setup
- Jan — open-source ChatGPT alternative, runs locally
Get a Personalized Recommendation
The VRAM tiers above are useful starting points, but the actual question is what works best for your specific GPU. Factors like memory bandwidth, compute architecture, and available system RAM for offloading all affect real-world performance.
Use our VRAM Calculator to enter your exact hardware and get ranked model recommendations with estimated performance. You can also compare two models side-by-side to see how they differ on your hardware.
The local AI ecosystem moves fast. Models that required 24GB last year now fit in 12GB due to improved quantization techniques and more efficient architectures. If you haven't looked at what's possible in the last six months, you'll likely be pleasantly surprised by what your hardware can handle today.