How much VRAM do I need to run a local LLM?

Rule of thumb: model parameters × bytes-per-weight + 15-25% overhead for the KV cache and activations. A 7B model at Q4 (~0.5 bytes/param) needs ~4.5 GB; 13B Q4 needs ~8 GB; 30B Q4 needs ~18 GB; 70B Q4 needs ~40 GB. Use the calculator on this page for an exact number per model, runtime, and context length.

Which quantization format should I pick (Q4, Q5, Q8, FP16)?

Q4_K_M is the default sweet spot: ~50% VRAM savings vs FP16 with <2% quality loss on most benchmarks. Pick Q5_K_M or Q6 if you have headroom and want margin; Q8 when VRAM is abundant and accuracy matters (RAG, agentic use). FP16 only for training or micro-tuning quality comparisons.

Will a 12 GB GPU run a 13B model?

Yes, at Q4_K_M a 13B model fits in ~8 GB of VRAM plus ~2-3 GB for an 8k context window. A 12 GB card like the RTX 4070 or 3060 handles this comfortably at 25-40 tok/s depending on the runtime. Larger contexts (32k+) push the KV cache past 12 GB and force offload — use the calculator to verify your exact combo.

Is a Mac or a GPU better for local AI?

Apple Silicon wins on memory capacity per dollar (a 64 GB M4 Pro runs 70B-class models that need an RTX 6000 Ada on NVIDIA). NVIDIA GPUs win on raw decode speed (2-3× on same model) and tooling maturity (CUDA/vLLM/Triton). Pick Mac for large models on a budget; pick NVIDIA for fast 13B-30B inference and production serving.

Does the calculator work with runtimes like llama.cpp, Ollama, vLLM, and MLX?

Yes. Pick your runtime in the selector — estimates adjust for KV cache layout, quantization support, and real-world overhead. llama.cpp/Ollama are most efficient at low batch sizes; vLLM shines for concurrent requests; MLX is Apple-Silicon native. Each has a unified-memory or VRAM overhead profile baked into the fit engine.

Will It Run AI · Calculator

Tell us what you own and what you want to do. We will rank the local models that make sense.

Start from your hardware and workload, then get a shortlist based on fit, speed, and runtime support instead of guessing from generic model lists or benchmark screenshots.

Start with your hardware See how ranking works

Live catalog snapshot: 196 hardware profiles, 374 models, 24 runtimes. That keeps the calculator aligned with the current catalog instead of a static benchmark list.

Now evaluating

RTX 4070 12GB

Workload

Coding

Runtime

llama.cpp

Operating mode

Balanced

Inputs

Pick the hardware, runtime, and workload you want to test.

Use the detected hardware if it is right, override it if it is not, and rerun the ranking to compare realistic local AI options.

Browser detection

Collecting GPU metadata…

Awaiting detection

Hardware

Custom hardware specs

RuntimeWorkloadOperating mode

Balanced for general local use. Keeps the ranking neutral across personal and serving workflows.

Update the hardware or workload and recalculate to refresh the ranking.

1. Fit

Memory fit and headroom decide whether a model is realistic on the selected hardware.

2. Workload

The score rewards models that match the selected task and penalizes stale or legacy families when newer specialist releases exist.

3. Speed

Decode throughput and TTFT keep the shortlist practical for real usage, not just theoretically possible runs.

Qwen

Qwen 3.5 9B

FrontierReleased Jun 2025Hugging FaceOllamaLM Studio

Why it wins

Qwen 3.5 9B is a specialized fit for Coding. It is a recent-generation family, which helps on current local SOTA workloads. It fits natively with comfortable headroom. Context coverage stays within the requested workload envelope. Known distribution channels: huggingface, ollama, lm-studio.

Capacity: Roomy · Bandwidth: Medium · Stack: Standard

Interactive: Good · Light API: Great · Bottleneck: Balanced

Rank #1

SRunsMEASURED

Score

122.0

Fit status

Runs well

Fit: Runs well with 32K safe context.

Runtime support: native via GGUF on cuda-local.

Runtime

llama.cpp

Artifact

GGUF

All 374 models

Full compatibility grid for RTX 4070 12GB

244 models fit · 9 excellent · 37 great

Grade

Model

Params

Tasks

Q4 VRAM

Decode

Context

Memory

Fit

Qwen

Qwen 3 8B

FrontierReleased Apr 2025Hugging FaceOllamaLM Studio

Why it wins

Qwen 3 8B is viable for Coding, but is not the most specialized choice. It is a recent-generation family, which helps on current local SOTA workloads. It fits natively with comfortable headroom. Context coverage stays within the requested workload envelope. Known distribution channels: huggingface, ollama, lm-studio.

Capacity: Roomy · Bandwidth: Medium · Stack: Standard

Interactive: Good · Light API: Great · Bottleneck: Balanced

Rank #8

SRunsEST.

Score

99.6

Fit status

Runs well

Fit: Runs well with 37K safe context.

Runtime support: native via GGUF on cpu-gpu-local.

Runtime

llama.cpp

Artifact

GGUF

Quant

q4-k-m

Decode

83.3 tok/s

Safe ctx

37K

Official ctx

131K

Support

native

TTFT

2325 ms

Weights: 4.9 GB

KV cache: 2.2 GB

Backend: cpu-gpu-local

Current limits

This setup is broadly balanced for this model.

No major red flags

This recommendation has enough memory headroom and acceptable estimated speed for the selected workload.

Best next improvements

Score 99.6 combines workload match, catalog freshness, fit safety, context coverage, artifact choice, memory utilization, throughput, and latency.

Tell us what you own and what you want to do. We will rank the local models that make sense.

Pick the hardware, runtime, and workload you want to test.

Qwen 3.5 9B

Full compatibility grid for RTX 4070 12GB

CodeGeeX 4 9B

Gemma 4 E4B

Codestral Mamba 7B

Yi Coder 9B

Granite 4.1 8B

Qwen 2.5 Coder 7B

Qwen 3 8B

Nemotron Nano 9B v2

Qwen 3.5 4B