Will It Run AI
local-ai, llm, vram, getting-started

What LLM Can I Run Locally on My GPU or Mac? (2026 Calculator + Guide)

Find the best local LLM for your exact VRAM: 4GB, 8GB, 12GB, 16GB, 24GB, 32GB, 48GB, 64GB+. Ranked picks for each tier with quant, tok/s, and one-click calculator.

The question is no longer can you run an LLM locally — it's which one fits your hardware. This guide ranks the best local LLM for every VRAM tier in April 2026 so you can decide in 30 seconds.

Quick decision table

Your VRAM / RAMBest local LLM (daily driver)VRAM Q4Alternative
4-6 GBGemma 3 4B~2.5 GBPhi-4-mini 3.8B
8 GBQwen 3.5 9B~5.1 GBLlama 3.1 8B
12 GBQwen 3 14B~8.3 GBGemma 3 9B Q8
16 GBQwen 3.5 27B~16 GBDeepSeek Coder V2.5 Lite
24 GBQwen 3 30B-A3B MoE~17 GBQwen 3 Coder 30B-A3B
32 GBQwen 3.5 35B-A3B~21 GBQwen 3 32B dense
48 GBQwen 3.5 35B-A3B Q6~28 GBLlama 4 Scout partial
64-80 GBQwen 3.5 122B-A10B~68 GBLlama 4 Maverick
128 GB+Qwen 3.5 397B-A17B Q4~222 GBDeepSeek V3 Q4 partial

Jump to your tier below, or skip straight to the VRAM Calculator for a personalized answer against your exact hardware.

Popular guides


Why Run LLMs Locally?

Before diving into specs, it's worth being clear on why local inference is worth the setup overhead:

Privacy. Every prompt you send to a cloud API leaves your machine. For code, personal documents, sensitive business data, or anything you'd rather not log with a third party, local inference is the only real answer.

Cost. Cloud APIs charge per token. A busy developer hitting GPT-4 or Claude for code completions all day can rack up $50-200/month without trying. A one-time GPU investment pays for itself quickly.

Speed. Cloud APIs introduce network latency and are subject to rate limits. A local model on a good GPU streams tokens faster than you can read them.

Offline access. Flights, conference halls, client sites with locked-down networks — your local model works anywhere.

The tradeoff is real: local models require hardware, and the most capable frontier models still run best in the cloud. But the gap has closed dramatically, and for most everyday tasks, a well-chosen local model is genuinely competitive.


It All Comes Down to Memory

The single most important number for running LLMs locally is VRAM — the dedicated memory on your GPU. When people say a model "doesn't fit," they almost always mean it doesn't fit in VRAM.

Why VRAM specifically? During inference, the model's weights must be loaded entirely into memory for fast access. A 7B parameter model at Q4 quantization takes roughly 4GB. At Q8, it takes about 7.4GB. Add KV cache for context and you need a bit more headroom on top.

If a model doesn't fit in VRAM, most runtimes fall back to CPU offloading — splitting the model across GPU and system RAM. This works, but it's dramatically slower. For a smooth experience, you want the whole model in VRAM.

A few other factors matter at the margins:

  • Memory bandwidth determines tokens-per-second. An RTX 4090 with 1.008 TB/s bandwidth is roughly 2x faster than an RTX 3090 at similar model sizes.
  • Compute (TFLOPS) matters less than bandwidth for inference, but becomes important for batch processing or fine-tuning.
  • Apple Silicon uses unified memory shared between CPU and GPU. An M4 Max with 64GB can use all 64GB for model weights — a significant advantage over discrete GPUs.

For a deep dive on memory math, see our VRAM requirements guide.


What Can You Run? A Tier-by-Tier Breakdown

4GB VRAM — Entry GPUs (GTX 1650, GTX 1660, RX 6500 XT)

4GB is tight. You're limited to small models at aggressive quantization, but "small" has become more capable than it sounds.

ModelQuantEst. VRAMNotes
Phi-4-mini (3.8B)Q4_K_M~2.4GBMicrosoft's punchy small model
Qwen 3 0.6BQ8~0.7GBTiny but surprisingly coherent
Llama 3.2 1BQ8~1.1GBFast, minimal footprint
Llama 3.2 3BQ4_K_M~2.0GBBetter quality, still fits

At 4GB you won't be running serious coding tasks or complex reasoning, but for summarization, simple Q&A, and lightweight chat these models are genuinely usable. Phi-4-mini in particular is a standout — Microsoft trained it to be efficient, and it shows.

Recommendation: Phi-4-mini Q4_K_M. Best capability-per-gigabyte at this tier.


8GB VRAM — Mid-Range (RTX 3060 8GB, RTX 4060, RX 7600)

8GB opens up the mainstream 7-8B model tier, which is where things get genuinely useful for everyday work.

ModelQuantEst. VRAMNotes
Llama 3.1 8BQ4_K_M~5.0GBMeta's solid all-rounder
Llama 3.2 3BQ8~3.4GBFast with headroom for context
Gemma 3 4BQ8~4.5GBGoogle's strong 4B model
Phi-4-miniQ8~4.1GBFull quality at this tier
Mistral 7BQ4_K_M~4.4GBFast, good instruction following

The 7-8B range at Q4 is the sweet spot for 8GB cards. You get models that can write code, draft emails, explain technical concepts, and hold multi-turn conversations — all running locally at 30-60 tokens/second on a modern GPU.

A note on context: with 8GB VRAM and a 5GB model loaded, you have about 3GB left for KV cache. That limits you to roughly 4-8K context depending on the model. If you need long context, you'll need to trade down to a smaller model or upgrade VRAM.

Recommendation: Llama 3.1 8B Q4_K_M for general use. Gemma 3 4B Q8 if you want better quality per token and shorter context is fine.


12GB VRAM — The Sweet Spot (RTX 4070 12GB, RTX 3080 10GB/12GB, RX 7700 XT)

12GB hits a real inflection point. You can now run 7-8B models at full Q8 quality, or push into 13B models at Q4. The RTX 4070 is probably the most popular card in this tier.

ModelQuantEst. VRAMNotes
Qwen 3 8BQ8~8.6GBExcellent reasoning, multilingual
Llama 3.1 8BQ8~8.5GBFull quality, no compromise
Mistral 7BQ8~7.7GBFast inference, good quality
DeepSeek R1 7BQ5_K_M~5.6GBReasoning specialist
Gemma 3 12BQ4_K_M~7.3GBGoogle's 12B, fits cleanly

At 12GB you're no longer making quality tradeoffs for 8B models. Running Qwen 3 8B or Llama 3.1 8B at Q8 gives you near-full model quality. The difference between Q4 and Q8 isn't always dramatic, but for code generation and complex reasoning it's noticeable.

DeepSeek R1 7B deserves a mention here: it's a reasoning model that uses chain-of-thought internally, making it substantially better than its parameter count suggests for math and logic tasks.

Recommendation: Qwen 3 8B Q8 for the best overall quality at this tier. DeepSeek R1 7B Q5_K_M if you do a lot of reasoning/math.


16GB VRAM — Upper Mid-Range (RTX 4070 Ti Super, RTX 5070, RX 7900 GRE)

16GB starts to unlock 13-14B models at useful quantization levels. This tier is a significant step up in quality, especially for coding and analytical tasks.

ModelQuantEst. VRAMNotes
Qwen 3 14BQ4_K_M~8.9GBStrong reasoning and coding
Phi-4 14BQ4_K_M~8.5GBMicrosoft's flagship 14B
Gemma 3 12BQ8~13.0GBFull quality 12B
Llama 3.1 8BQ8~8.5GBFits with plenty of KV headroom
Mistral Nemo 12BQ5_K_M~8.4GBLong context specialist

Qwen 3 14B at Q4 is a genuine jump in quality over 8B models — noticeably better at multi-step reasoning, code generation, and complex instructions. Phi-4 14B is Microsoft's strong entry in this class, particularly good for STEM tasks.

With 16GB you also have room for a larger KV cache, which means you can work with longer documents and maintain more context in multi-turn conversations.

Recommendation: Qwen 3 14B Q4_K_M. If you want Microsoft's flavor, Phi-4 14B Q4_K_M is nearly equivalent.


24GB VRAM — Enthusiast Tier (RTX 4090, RTX 3090 Ti, RTX 5090)

24GB is where local inference becomes genuinely impressive. This is the most popular tier for serious local AI users, and for good reason.

ModelQuantEst. VRAMNotes
Qwen 3 30B-A3Bnative~19GBMoE — 30B params, 3B active
Mistral Small 24BQ4_K_M~14.2GBExcellent instruction following
DeepSeek R1 32BQ4_K_M~19.6GBFrontier-quality reasoning
Qwen 3 14BQ8~15.7GBFull quality 14B
Llama 3.3 70BQ2_K~23.0GBBarely fits, quality degraded

The headline at 24GB is Qwen 3 30B-A3B. This is a Mixture-of-Experts model that has 30B total parameters but only activates about 3B per token. It fits in ~19GB at native precision and performs like a much larger dense model. For the RTX 4090, this is the recommended daily driver — you get near-frontier quality that would have required a data center two years ago.

DeepSeek R1 32B at Q4_K_M is the other standout. It's a full reasoning model (not distilled) at 32B parameters, and at Q4 it fits in ~19.6GB. For tasks requiring deep reasoning — complex code, math, multi-step analysis — it's remarkable running on consumer hardware.

Recommendation: Qwen 3 30B-A3B for general use. DeepSeek R1 32B Q4_K_M for reasoning-heavy tasks. Compare them head to head.


32GB+ — High-End Consumer (RTX 5090 32GB, Mac M4 Pro 24-48GB)

The RTX 5090 with 32GB VRAM and Apple Silicon Macs with 24-48GB unified memory occupy similar territory, though their architectures are different.

ModelQuantEst. VRAMNotes
Llama 4 ScoutQ4_K_M~28GBMeta's latest, fits cleanly
DeepSeek R1 32BQ8~34GBFull quality reasoning
Llama 3.3 70BQ3_K_M~31GBUsable 70B quality
Qwen 3 30B-A3BQ8~32GBFull quality MoE

At 32GB you can run Llama 4 Scout natively, or push into 70B models at lower quantization. Llama 3.3 70B at Q3 is a real tradeoff — quality is noticeably degraded versus Q4+ — but it's usable for tasks where size matters more than precision.

Mac M4 Pro owners with 48GB unified memory have a significant advantage: they can run 70B models at Q4 or better, which is a meaningful quality step up.


48GB+ — Workstation / Mac M4 Max (Mac M4 Max 64GB, RTX A6000, Dual GPU)

ModelQuantEst. VRAMNotes
Llama 3.3 70BQ6_K~57GBNear-full quality 70B
Qwen 3 235B-A22BQ4_K_M~148GBRequires 192GB+
DeepSeek R1 671BQ2_K~178GBRequires 192GB+
Llama 3.1 70BQ8~74GBFits on 80GB cards

The Mac M4 Max with 64GB is a standout here. It can run Llama 3.3 70B at Q5_K_M with good quality — something that requires a $10,000+ NVIDIA A100 in the discrete GPU world. Apple Silicon's unified memory architecture means all 64GB is available to the model.

For workstations with A6000 (48GB) or dual-GPU setups, 70B models at high quality become the daily driver.


64GB+ and Beyond — Mac M4 Ultra, Datacenter

ModelQuantEst. VRAMNotes
Qwen 3 235B-A22BQ4_K_M~148GBNeeds 192GB+
DeepSeek R1 671BQ2_K~178GBWith offloading on 128GB
Llama 3.1 405BQ4_K_M~243GBFull frontier scale

At 128GB+ unified memory (Mac M4 Ultra or above), you can run the largest open-source models with some offloading. DeepSeek R1 671B at Q2_K on a Mac Studio with 192GB unified memory is genuinely possible — though Q2 quantization has real quality tradeoffs, and it won't be fast.

For most people, the 24-64GB tier is where local inference peaks in practical terms. Larger models exist, but the diminishing returns on quality relative to hardware cost become steep.


Best Models by Use Case

Coding

Good code generation requires strong reasoning and a large training corpus of code. Parameter count matters more here than for general chat.

  • Limited VRAM (8-12GB): Qwen 3 8B Q5_K_M or Phi-4-mini — both handle common coding tasks well
  • Mid tier (16-24GB): Qwen 3 30B-A3B (MoE, punches well above weight for code), DeepSeek Coder V2 Lite
  • High end (48GB+): Llama 3.3 70B Q6 or Qwen 3 72B Q4

For coding specifically, MoE architecture models are a great deal: Qwen 3 30B-A3B activates only 3B parameters per token while training on 30B, giving you strong code quality at a fraction of the inference cost.

General Chat and Instruction Following

For everyday conversation, email drafting, summarization, and general assistant tasks, almost any 7B+ model performs well. Prioritize instruction tuning quality.

  • 8GB: Llama 3.1 8B Q4_K_M — Meta's instruction tuning is excellent
  • 12GB: Mistral 7B Q8 — fast, reliable, very good at following instructions
  • 24GB: Mistral Small 24B Q4 — excellent instruction following at this scale

Reasoning and Math

Reasoning tasks (math problems, logic puzzles, multi-step analysis) benefit enormously from models trained with chain-of-thought reasoning. Standard models struggle here regardless of size.

  • 12GB: DeepSeek R1 7B Q5_K_M — purpose-built for reasoning, surprisingly strong
  • 24GB: DeepSeek R1 32B Q4_K_M — near-frontier reasoning quality locally
  • 48GB+: DeepSeek R1 671B Q2 (with offloading) or Llama 3.3 70B with reasoning prompting

The DeepSeek R1 family is the clear choice for reasoning. The distilled variants (7B, 14B, 32B) bring the reasoning capabilities to consumer hardware at a fraction of the full model's size.

Creative Writing

Creative tasks are more forgiving of quantization and model size — a well-prompted 8B model can write excellent fiction. Context window matters more here.

  • Any tier: Llama 3.1 8B or Mistral 7B with good system prompts
  • 24GB: Mistral Small 24B — notably better at maintaining character voice and narrative coherence
  • 48GB+: Llama 3.3 70B — where you start to see frontier-quality prose

For creative writing, consider investing in prompt engineering before chasing larger models. A well-prompted 8B model often outperforms a poorly prompted 70B model on creative tasks.


Understanding Quantization (Quick Version)

Quantization reduces the numerical precision used to store model weights, trading a small amount of quality for dramatically lower memory requirements.

FormatBits/WeightQualityUse When
Q2_K~2.6Noticeably degradedOnly when nothing else fits
Q3_K_M~3.4Usable, some degradationTight VRAM, large models
Q4_K_M~4.5Good — common sweet spotDefault for most users
Q5_K_M~5.7Very goodWhen you have headroom
Q6_K~6.6Excellent, near-losslessHigh VRAM, max quality
Q8_08.0Effectively losslessWhen VRAM isn't a constraint

The practical guidance: Q4_K_M is the default. It's the community standard for good reason — it cuts memory requirements roughly in half versus Q8 with minimal quality impact for most tasks. If you have headroom, step up to Q5_K_M or Q6_K. Only drop below Q4 if a model genuinely won't fit otherwise.

For a full technical breakdown, see our quantization guide.


How to Get Started in 3 Minutes

The easiest way to run local models is Ollama — it handles model downloads, quantization selection, and provides an OpenAI-compatible API.

Install Ollama (Linux/Mac):

curl -fsSL https://ollama.com/install.sh | sh

Run your first model:

# For 8GB VRAM
ollama run llama3.1:8b

# For 12GB VRAM
ollama run qwen3:8b

# For 24GB VRAM
ollama run qwen3:30b-a3b

# For reasoning tasks (12GB+)
ollama run deepseek-r1:7b

Ollama automatically selects an appropriate quantization for your hardware. After the first run, models are cached locally — subsequent starts take seconds.

Use it as an API:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1:8b",
  "prompt": "Explain the difference between RAG and fine-tuning",
  "stream": false
}'

This API is compatible with most tools that support OpenAI's format — including Continue.dev for VS Code, Open WebUI, and dozens of others.

Alternatives worth knowing:

  • LM Studio — GUI-first, great for non-technical users, excellent model browser
  • llama.cpp — the underlying engine, most control, slightly more setup
  • Jan — open-source ChatGPT alternative, runs locally

Get a Personalized Recommendation

The VRAM tiers above are useful starting points, but the actual question is what works best for your specific GPU. Factors like memory bandwidth, compute architecture, and available system RAM for offloading all affect real-world performance.

Use our VRAM Calculator to enter your exact hardware and get ranked model recommendations with estimated performance. You can also compare two models side-by-side to see how they differ on your hardware.

The local AI ecosystem moves fast. Models that required 24GB last year now fit in 12GB due to improved quantization techniques and more efficient architectures. If you haven't looked at what's possible in the last six months, you'll likely be pleasantly surprised by what your hardware can handle today.

Frequently Asked Questions

What is the best LLM to run locally?

The best local LLM depends on your hardware. For 8GB VRAM, Llama 3.1 8B or Qwen 3 8B are excellent. For 24GB, Qwen 3 30B-A3B or DeepSeek R1 32B offer near-frontier quality. Use our calculator at willitrunai.com to get personalized recommendations.

Can I run an LLM on 8GB VRAM?

Yes. With 8GB VRAM you can comfortably run 3B-8B parameter models like Llama 3.2 3B (Q8), Gemma 3 4B, or Phi-4-mini. Some 7-8B models fit at Q4 quantization with room for a reasonable context window.

What is the smallest useful LLM?

Models around 3-4B parameters like Phi-4-mini (3.8B) and Llama 3.2 3B are surprisingly capable for their size. They handle basic chat, summarization, and simple coding well, requiring only 2-4GB of VRAM at Q4.

Do I need a GPU to run LLMs locally?

A GPU is strongly recommended for usable speed. CPU-only inference works but is 10-50x slower. Apple Silicon Macs use unified memory which works well. For the best experience, you want at least 8GB of dedicated VRAM.

What is the best free local LLM for coding?

Qwen 3 Coder 30B-A3B is excellent if you have 24GB+ VRAM — it's a MoE model that punches above its weight. For less VRAM, DeepSeek Coder V2 Lite (16B) or Qwen 3 8B work well for coding tasks.

How much VRAM do I need for a 7B model?

A 7B parameter model needs approximately 3.9GB at Q4_K_M, 4.8GB at Q5_K_M, or 7.4GB at Q8 quantization. With KV cache overhead, plan for 5-8GB total depending on quantization and context length.