What is the best LLM to run locally?

The best local LLM depends on your hardware. For 8GB VRAM, Llama 3.1 8B or Qwen 3 8B are excellent. For 24GB, Qwen 3 30B-A3B or DeepSeek R1 32B offer near-frontier quality. Use our calculator at willitrunai.com to get personalized recommendations.

Can I run an LLM on 8GB VRAM?

Yes. With 8GB VRAM you can comfortably run 3B-8B parameter models like Llama 3.2 3B (Q8), Gemma 3 4B, or Phi-4-mini. Some 7-8B models fit at Q4 quantization with room for a reasonable context window.

What is the smallest useful LLM?

Models around 3-4B parameters like Phi-4-mini (3.8B) and Llama 3.2 3B are surprisingly capable for their size. They handle basic chat, summarization, and simple coding well, requiring only 2-4GB of VRAM at Q4.

Do I need a GPU to run LLMs locally?

A GPU is strongly recommended for usable speed. CPU-only inference works but is 10-50x slower. Apple Silicon Macs use unified memory which works well. For the best experience, you want at least 8GB of dedicated VRAM.

What is the best free local LLM for coding?

Qwen 3 Coder 30B-A3B is excellent if you have 24GB+ VRAM — it's a MoE model that punches above its weight. For less VRAM, DeepSeek Coder V2 Lite (16B) or Qwen 3 8B work well for coding tasks.

How much VRAM do I need for a 7B model?

A 7B parameter model needs approximately 3.9GB at Q4_K_M, 4.8GB at Q5_K_M, or 7.4GB at Q8 quantization. With KV cache overhead, plan for 5-8GB total depending on quantization and context length.

March 23, 2026Updated April 22, 2026local-ai, llm, vram, getting-started

What LLM Can I Run Locally on My GPU or Mac? (2026 Calculator + Guide)

Find the best local LLM for your exact VRAM: 4GB, 8GB, 12GB, 16GB, 24GB, 32GB, 48GB, 64GB+. Ranked picks for each tier with quant, tok/s, and one-click calculator.

The question is no longer can you run an LLM locally — it's which one fits your hardware. This guide ranks the best local LLM for every VRAM tier in April 2026 so you can decide in 30 seconds.

Quick decision table

Your VRAM / RAM	Best local LLM (daily driver)	VRAM Q4	Alternative
4-6 GB	Gemma 3 4B	~2.5 GB	Phi-4-mini 3.8B
8 GB	Qwen 3.5 9B	~5.1 GB	Llama 3.1 8B
12 GB	Qwen 3 14B	~8.3 GB	Gemma 3 9B Q8
16 GB	Qwen 3.5 27B	~16 GB	DeepSeek Coder V2.5 Lite
24 GB	Qwen 3 30B-A3B MoE	~17 GB	Qwen 3 Coder 30B-A3B
32 GB	Qwen 3.5 35B-A3B	~21 GB	Qwen 3 32B dense
48 GB	Qwen 3.5 35B-A3B Q6	~28 GB	Llama 4 Scout partial
64-80 GB	Qwen 3.5 122B-A10B	~68 GB	Llama 4 Maverick
128 GB+	Qwen 3.5 397B-A17B Q4	~222 GB	DeepSeek V3 Q4 partial

Jump to your tier below, or skip straight to the VRAM Calculator for a personalized answer against your exact hardware.

Popular guides

Qwen3.6-35B-A3B VRAM Requirements — latest MoE at Q4/Q5/Q6/Q8
Qwen3.5-35B-A3B VRAM Requirements — sweet spot for 24GB
Qwen3.6-35B-A3B Hardware Requirements (Buyer Guide) — which GPU or Mac to buy
MacBook Air M4 vs Pro M4 for Local LLMs — decision guide
Best Local Coding LLMs for Apple Silicon 24GB
Ollama Multi-GPU Support — official behavior + performance
Best Local LLMs for MacBook Air M4 24GB — ranked Top 10
Best Local LLMs for MacBook Pro M4 Pro 24GB

Why Run LLMs Locally?

Before diving into specs, it's worth being clear on why local inference is worth the setup overhead:

Privacy. Every prompt you send to a cloud API leaves your machine. For code, personal documents, sensitive business data, or anything you'd rather not log with a third party, local inference is the only real answer.

Cost. Cloud APIs charge per token. A busy developer hitting GPT-4 or Claude for code completions all day can rack up $50-200/month without trying. A one-time GPU investment pays for itself quickly.

Speed. Cloud APIs introduce network latency and are subject to rate limits. A local model on a good GPU streams tokens faster than you can read them.

Offline access. Flights, conference halls, client sites with locked-down networks — your local model works anywhere.

The tradeoff is real: local models require hardware, and the most capable frontier models still run best in the cloud. But the gap has closed dramatically, and for most everyday tasks, a well-chosen local model is genuinely competitive.

It All Comes Down to Memory

The single most important number for running LLMs locally is VRAM — the dedicated memory on your GPU. When people say a model "doesn't fit," they almost always mean it doesn't fit in VRAM.

Why VRAM specifically? During inference, the model's weights must be loaded entirely into memory for fast access. A 7B parameter model at Q4 quantization takes roughly 4GB. At Q8, it takes about 7.4GB. Add KV cache for context and you need a bit more headroom on top.

If a model doesn't fit in VRAM, most runtimes fall back to CPU offloading — splitting the model across GPU and system RAM. This works, but it's dramatically slower. For a smooth experience, you want the whole model in VRAM.

A few other factors matter at the margins:

Memory bandwidth determines tokens-per-second. An RTX 4090 with 1.008 TB/s bandwidth is roughly 2x faster than an RTX 3090 at similar model sizes.
Compute (TFLOPS) matters less than bandwidth for inference, but becomes important for batch processing or fine-tuning.
Apple Silicon uses unified memory shared between CPU and GPU. An M4 Max with 64GB can use all 64GB for model weights — a significant advantage over discrete GPUs.

For a deep dive on memory math, see our VRAM requirements guide.

What Can You Run? A Tier-by-Tier Breakdown

4GB VRAM — Entry GPUs (GTX 1650, GTX 1660, RX 6500 XT)

4GB is tight. You're limited to small models at aggressive quantization, but "small" has become more capable than it sounds.

Model	Quant	Est. VRAM	Notes
Phi-4-mini (3.8B)	Q4_K_M	~2.4GB	Microsoft's punchy small model
Qwen 3 0.6B	Q8	~0.7GB	Tiny but surprisingly coherent
Llama 3.2 1B	Q8	~1.1GB	Fast, minimal footprint
Llama 3.2 3B	Q4_K_M	~2.0GB	Better quality, still fits

At 4GB you won't be running serious coding tasks or complex reasoning, but for summarization, simple Q&A, and lightweight chat these models are genuinely usable. Phi-4-mini in particular is a standout — Microsoft trained it to be efficient, and it shows.

Recommendation: Phi-4-mini Q4_K_M. Best capability-per-gigabyte at this tier.

8GB VRAM — Mid-Range (RTX 3060 8GB, RTX 4060, RX 7600)

8GB opens up the mainstream 7-8B model tier, which is where things get genuinely useful for everyday work.

Model	Quant	Est. VRAM	Notes
Llama 3.1 8B	Q4_K_M	~5.0GB	Meta's solid all-rounder
Llama 3.2 3B	Q8	~3.4GB	Fast with headroom for context
Gemma 3 4B	Q8	~4.5GB	Google's strong 4B model
Phi-4-mini	Q8	~4.1GB	Full quality at this tier
Mistral 7B	Q4_K_M	~4.4GB	Fast, good instruction following

The 7-8B range at Q4 is the sweet spot for 8GB cards. You get models that can write code, draft emails, explain technical concepts, and hold multi-turn conversations — all running locally at 30-60 tokens/second on a modern GPU.

A note on context: with 8GB VRAM and a 5GB model loaded, you have about 3GB left for KV cache. That limits you to roughly 4-8K context depending on the model. If you need long context, you'll need to trade down to a smaller model or upgrade VRAM.

Recommendation: Llama 3.1 8B Q4_K_M for general use. Gemma 3 4B Q8 if you want better quality per token and shorter context is fine.

12GB VRAM — The Sweet Spot (RTX 4070 12GB, RTX 3080 10GB/12GB, RX 7700 XT)

12GB hits a real inflection point. You can now run 7-8B models at full Q8 quality, or push into 13B models at Q4. The RTX 4070 is probably the most popular card in this tier.

Model	Quant	Est. VRAM	Notes
Qwen 3 8B	Q8	~8.6GB	Excellent reasoning, multilingual
Llama 3.1 8B	Q8	~8.5GB	Full quality, no compromise
Mistral 7B	Q8	~7.7GB	Fast inference, good quality
DeepSeek R1 7B	Q5_K_M	~5.6GB	Reasoning specialist
Gemma 3 12B	Q4_K_M	~7.3GB	Google's 12B, fits cleanly

At 12GB you're no longer making quality tradeoffs for 8B models. Running Qwen 3 8B or Llama 3.1 8B at Q8 gives you near-full model quality. The difference between Q4 and Q8 isn't always dramatic, but for code generation and complex reasoning it's noticeable.

DeepSeek R1 7B deserves a mention here: it's a reasoning model that uses chain-of-thought internally, making it substantially better than its parameter count suggests for math and logic tasks.

Recommendation: Qwen 3 8B Q8 for the best overall quality at this tier. DeepSeek R1 7B Q5_K_M if you do a lot of reasoning/math.

16GB VRAM — Upper Mid-Range (RTX 4070 Ti Super, RTX 5070, RX 7900 GRE)

16GB starts to unlock 13-14B models at useful quantization levels. This tier is a significant step up in quality, especially for coding and analytical tasks.

Model	Quant	Est. VRAM	Notes
Qwen 3 14B	Q4_K_M	~8.9GB	Strong reasoning and coding
Phi-4 14B	Q4_K_M	~8.5GB	Microsoft's flagship 14B
Gemma 3 12B	Q8	~13.0GB	Full quality 12B
Llama 3.1 8B	Q8	~8.5GB	Fits with plenty of KV headroom
Mistral Nemo 12B	Q5_K_M	~8.4GB	Long context specialist

Qwen 3 14B at Q4 is a genuine jump in quality over 8B models — noticeably better at multi-step reasoning, code generation, and complex instructions. Phi-4 14B is Microsoft's strong entry in this class, particularly good for STEM tasks.

With 16GB you also have room for a larger KV cache, which means you can work with longer documents and maintain more context in multi-turn conversations.

Recommendation: Qwen 3 14B Q4_K_M. If you want Microsoft's flavor, Phi-4 14B Q4_K_M is nearly equivalent.

24GB VRAM — Enthusiast Tier (RTX 4090, RTX 3090 Ti, RTX 5090)

24GB is where local inference becomes genuinely impressive. This is the most popular tier for serious local AI users, and for good reason.

Model	Quant	Est. VRAM	Notes
Qwen 3 30B-A3B	native	~19GB	MoE — 30B params, 3B active
Mistral Small 24B	Q4_K_M	~14.2GB	Excellent instruction following
DeepSeek R1 32B	Q4_K_M	~19.6GB	Frontier-quality reasoning
Qwen 3 14B	Q8	~15.7GB	Full quality 14B
Llama 3.3 70B	Q2_K	~23.0GB	Barely fits, quality degraded

The headline at 24GB is Qwen 3 30B-A3B. This is a Mixture-of-Experts model that has 30B total parameters but only activates about 3B per token. It fits in ~19GB at native precision and performs like a much larger dense model. For the RTX 4090, this is the recommended daily driver — you get near-frontier quality that would have required a data center two years ago.

DeepSeek R1 32B at Q4_K_M is the other standout. It's a full reasoning model (not distilled) at 32B parameters, and at Q4 it fits in ~19.6GB. For tasks requiring deep reasoning — complex code, math, multi-step analysis — it's remarkable running on consumer hardware.

Recommendation: Qwen 3 30B-A3B for general use. DeepSeek R1 32B Q4_K_M for reasoning-heavy tasks. Compare them head to head.

32GB+ — High-End Consumer (RTX 5090 32GB, Mac M4 Pro 24-48GB)

The RTX 5090 with 32GB VRAM and Apple Silicon Macs with 24-48GB unified memory occupy similar territory, though their architectures are different.

Model	Quant	Est. VRAM	Notes
Llama 4 Scout	Q4_K_M	~28GB	Meta's latest, fits cleanly
DeepSeek R1 32B	Q8	~34GB	Full quality reasoning
Llama 3.3 70B	Q3_K_M	~31GB	Usable 70B quality
Qwen 3 30B-A3B	Q8	~32GB	Full quality MoE

At 32GB you can run Llama 4 Scout natively, or push into 70B models at lower quantization. Llama 3.3 70B at Q3 is a real tradeoff — quality is noticeably degraded versus Q4+ — but it's usable for tasks where size matters more than precision.

Mac M4 Pro owners with 48GB unified memory have a significant advantage: they can run 70B models at Q4 or better, which is a meaningful quality step up.

48GB+ — Workstation / Mac M4 Max (Mac M4 Max 64GB, RTX A6000, Dual GPU)

Model	Quant	Est. VRAM	Notes
Llama 3.3 70B	Q6_K	~57GB	Near-full quality 70B
Qwen 3 235B-A22B	Q4_K_M	~148GB	Requires 192GB+
DeepSeek R1 671B	Q2_K	~178GB	Requires 192GB+
Llama 3.1 70B	Q8	~74GB	Fits on 80GB cards

The Mac M4 Max with 64GB is a standout here. It can run Llama 3.3 70B at Q5_K_M with good quality — something that requires a $10,000+ NVIDIA A100 in the discrete GPU world. Apple Silicon's unified memory architecture means all 64GB is available to the model.

For workstations with A6000 (48GB) or dual-GPU setups, 70B models at high quality become the daily driver.

64GB+ and Beyond — Mac M4 Ultra, Datacenter

Model	Quant	Est. VRAM	Notes
Qwen 3 235B-A22B	Q4_K_M	~148GB	Needs 192GB+
DeepSeek R1 671B	Q2_K	~178GB	With offloading on 128GB
Llama 3.1 405B	Q4_K_M	~243GB	Full frontier scale

At 128GB+ unified memory (Mac M4 Ultra or above), you can run the largest open-source models with some offloading. DeepSeek R1 671B at Q2_K on a Mac Studio with 192GB unified memory is genuinely possible — though Q2 quantization has real quality tradeoffs, and it won't be fast.

For most people, the 24-64GB tier is where local inference peaks in practical terms. Larger models exist, but the diminishing returns on quality relative to hardware cost become steep.

Best Models by Use Case

Coding

Good code generation requires strong reasoning and a large training corpus of code. Parameter count matters more here than for general chat.

Limited VRAM (8-12GB): Qwen 3 8B Q5_K_M or Phi-4-mini — both handle common coding tasks well
Mid tier (16-24GB): Qwen 3 30B-A3B (MoE, punches well above weight for code), DeepSeek Coder V2 Lite
High end (48GB+): Llama 3.3 70B Q6 or Qwen 3 72B Q4

For coding specifically, MoE architecture models are a great deal: Qwen 3 30B-A3B activates only 3B parameters per token while training on 30B, giving you strong code quality at a fraction of the inference cost.

General Chat and Instruction Following

For everyday conversation, email drafting, summarization, and general assistant tasks, almost any 7B+ model performs well. Prioritize instruction tuning quality.

8GB: Llama 3.1 8B Q4_K_M — Meta's instruction tuning is excellent
12GB: Mistral 7B Q8 — fast, reliable, very good at following instructions
24GB: Mistral Small 24B Q4 — excellent instruction following at this scale

Reasoning and Math

Reasoning tasks (math problems, logic puzzles, multi-step analysis) benefit enormously from models trained with chain-of-thought reasoning. Standard models struggle here regardless of size.

12GB: DeepSeek R1 7B Q5_K_M — purpose-built for reasoning, surprisingly strong
24GB: DeepSeek R1 32B Q4_K_M — near-frontier reasoning quality locally
48GB+: DeepSeek R1 671B Q2 (with offloading) or Llama 3.3 70B with reasoning prompting

The DeepSeek R1 family is the clear choice for reasoning. The distilled variants (7B, 14B, 32B) bring the reasoning capabilities to consumer hardware at a fraction of the full model's size.

Creative Writing

Creative tasks are more forgiving of quantization and model size — a well-prompted 8B model can write excellent fiction. Context window matters more here.

Any tier: Llama 3.1 8B or Mistral 7B with good system prompts
24GB: Mistral Small 24B — notably better at maintaining character voice and narrative coherence
48GB+: Llama 3.3 70B — where you start to see frontier-quality prose

For creative writing, consider investing in prompt engineering before chasing larger models. A well-prompted 8B model often outperforms a poorly prompted 70B model on creative tasks.

Understanding Quantization (Quick Version)

Quantization reduces the numerical precision used to store model weights, trading a small amount of quality for dramatically lower memory requirements.

Format	Bits/Weight	Quality	Use When
Q2_K	~2.6	Noticeably degraded	Only when nothing else fits
Q3_K_M	~3.4	Usable, some degradation	Tight VRAM, large models
Q4_K_M	~4.5	Good — common sweet spot	Default for most users
Q5_K_M	~5.7	Very good	When you have headroom
Q6_K	~6.6	Excellent, near-lossless	High VRAM, max quality
Q8_0	8.0	Effectively lossless	When VRAM isn't a constraint

The practical guidance: Q4_K_M is the default. It's the community standard for good reason — it cuts memory requirements roughly in half versus Q8 with minimal quality impact for most tasks. If you have headroom, step up to Q5_K_M or Q6_K. Only drop below Q4 if a model genuinely won't fit otherwise.

For a full technical breakdown, see our quantization guide.

How to Get Started in 3 Minutes

The easiest way to run local models is Ollama — it handles model downloads, quantization selection, and provides an OpenAI-compatible API.

Install Ollama (Linux/Mac):

curl -fsSL https://ollama.com/install.sh | sh

Run your first model:

# For 8GB VRAM
ollama run llama3.1:8b

# For 12GB VRAM
ollama run qwen3:8b

# For 24GB VRAM
ollama run qwen3:30b-a3b

# For reasoning tasks (12GB+)
ollama run deepseek-r1:7b

Ollama automatically selects an appropriate quantization for your hardware. After the first run, models are cached locally — subsequent starts take seconds.

Use it as an API:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1:8b",
  "prompt": "Explain the difference between RAG and fine-tuning",
  "stream": false
}'

This API is compatible with most tools that support OpenAI's format — including Continue.dev for VS Code, Open WebUI, and dozens of others.

Alternatives worth knowing:

LM Studio — GUI-first, great for non-technical users, excellent model browser
llama.cpp — the underlying engine, most control, slightly more setup
Jan — open-source ChatGPT alternative, runs locally

Get a Personalized Recommendation

The VRAM tiers above are useful starting points, but the actual question is what works best for your specific GPU. Factors like memory bandwidth, compute architecture, and available system RAM for offloading all affect real-world performance.

Use our VRAM Calculator to enter your exact hardware and get ranked model recommendations with estimated performance. You can also compare two models side-by-side to see how they differ on your hardware.

The local AI ecosystem moves fast. Models that required 24GB last year now fit in 12GB due to improved quantization techniques and more efficient architectures. If you haven't looked at what's possible in the last six months, you'll likely be pleasantly surprised by what your hardware can handle today.