Quantization Explained: Q4 vs Q8 vs FP16 — What You Actually Lose
Plain-English explanation of AI model quantization. What Q4, Q8, and FP16 mean, how much quality you lose at each level, and which to use for your hardware.
You've downloaded a model and there's a list of files: model-Q4_K_M.gguf, model-Q6_K.gguf, model-Q8_0.gguf, model-F16.gguf. They're all the same model. Why are there four versions? Which one should you use? And what exactly happens to the model when you pick a lower number?
This guide explains quantization from first principles — what it is, what each level actually does, and what you give up when you choose Q4 over Q8 over FP16.
What a Model Is Made Of
An AI model is a large collection of numbers. A 7B model has roughly 7 billion of them. These numbers are called weights or parameters — they encode everything the model has learned during training: language patterns, factual knowledge, reasoning strategies, writing styles.
At training time, these weights are stored as 32-bit or 16-bit floating-point numbers. Each 16-bit float takes 2 bytes of storage. So a 7 billion parameter model at FP16 needs:
7,000,000,000 × 2 bytes = 14 GB
That's just the weights. Add 1–3 GB for the KV cache (the working memory used during inference) and you need 15–17 GB of VRAM for a 7B model at full precision. Most consumer GPUs have 8–16 GB. A 70B model at FP16 needs 140 GB — more than any single consumer GPU holds.
Quantization solves this.
What Quantization Does
Quantization replaces each high-precision weight (a 32-bit or 16-bit float) with a lower-precision integer approximation.
Instead of storing a weight as 0.1234567890 (a precise 16-bit float), quantization might store it as 3 out of a scale of 0–15 (a 4-bit integer). You need a scale factor to translate that 3 back into an approximate float, but the scale factor is shared across a block of weights, amortizing its cost.
The result: you can represent 4 weights in 2 bytes (Q4) instead of 8 bytes (FP16). You've used 75% less memory. You've also introduced small approximation errors — the stored value isn't quite the original. Whether those errors matter depends on how sensitive the model is and what task you're running.
The key insight: Neural networks are surprisingly robust to small errors in individual weights. The useful information is distributed across billions of parameters, so small precision losses tend to cancel out. This is what makes quantization viable at all.
FP16: The Baseline
FP16 (16-bit floating point, also written as "half precision") is the standard precision for AI model inference and fine-tuning. Each weight uses exactly 2 bytes.
Memory: 2 bytes per parameter. 7B model = ~14 GB, 70B model = ~140 GB.
Quality: The ceiling. Everything else is measured against FP16.
When to use it: Fine-tuning (where precision matters for gradient computation), professional serving environments, or benchmarking. For local inference on a single consumer GPU, FP16 is usually impractical — it needs too much VRAM.
You'll also see BF16 (bfloat16) — it has the same memory footprint as FP16 but a different numerical range. For inference quality, they're effectively equivalent.
Q8: Near-Lossless Compression
Q8_0 stores each weight as an 8-bit integer (1 byte, effectively) plus block scaling factors. The total is about 1.06 bytes per parameter — roughly a 47% reduction from FP16.
Memory: ~1.06 bytes per parameter. 7B model = ~7.5 GB, 70B model = ~74 GB.
Quality: Essentially lossless. The difference between Q8 and FP16 output is statistically indistinguishable in most benchmarks. Perplexity (a measure of language model quality) degrades by less than 1% at Q8.
What you lose: Practically nothing for typical inference tasks. If you run a Q8 and FP16 version of the same model side by side and compare 1,000 outputs, you'll struggle to reliably identify which is which.
When to use it: Whenever your VRAM comfortably fits Q8. It gives you FP16-quality output at half the memory cost. The only reason to use FP16 over Q8 for inference is fine-tuning or very precise numerical work.
Practical note: A 7B model at Q8 fits in 12 GB VRAM. An RTX 4070 12 GB, RTX 3060 12 GB, or Apple M4 Mac mini 16 GB can run 7B at Q8 with room for context.
Q6: High Quality, Better Efficiency
Q6_K uses 6-bit k-quantization — 0.81 bytes per parameter, about 60% less than FP16.
Memory: ~0.81 bytes per parameter. 7B model = ~5.7 GB, 70B model = ~57 GB.
Quality: Excellent. Very slight degradation from FP16, imperceptible in most real-world tasks. Perplexity typically increases by 1–3% versus FP16. For chat, writing, coding, and reasoning, Q6 outputs are nearly identical to FP16 outputs.
What you lose: Negligible for everyday use. Expert evaluators in controlled studies can sometimes detect Q6 versus FP16 on long-form creative or analytical tasks. In practice, users almost never notice.
When to use it: The recommended default when your VRAM allows the model to fit at Q6 with comfortable headroom. It threads the needle between Q8's near-lossless quality and Q5's better efficiency.
Key difference from Q8: You save about 25% more VRAM compared to Q8 for about 1–3% more quality loss. For most users, Q6 is a better trade-off than Q8 — you get nearly the same quality at meaningfully lower memory cost.
Q5: The Sweet Spot
Q5_K_M (5-bit k-quant, medium) uses 0.69 bytes per parameter — 65% less than FP16.
Memory: ~0.69 bytes per parameter. 7B model = ~4.8 GB, 70B model = ~48 GB.
Quality: Good. Perplexity increases 3–6% versus FP16. Outputs are slightly less precise than Q6 but still strong for most tasks. The quality difference is occasionally noticeable in long, complex reasoning chains — but most users won't reliably detect it in typical use.
What you lose: You start to see occasional divergences in edge cases: slightly less precise numerical reasoning, occasional less-optimal word choices in nuanced writing, marginally weaker performance on coding edge cases. None of this is dramatic.
When to use it: When your VRAM fits the model comfortably at Q5 but not Q6. Also a good choice when you want slightly better efficiency than Q6 without dropping to Q4 quality. Many users run Q5 as their standard level and can't tell the difference from Q6 in daily use.
Q4: The Mainstream Choice
Q4_K_M (4-bit k-quant, medium) is the most widely used quantization level. It uses 0.56 bytes per parameter — 72% less than FP16.
Memory: ~0.56 bytes per parameter. 7B model = ~3.9 GB, 14B model = ~7.8 GB, 70B model = ~39 GB.
Quality: Acceptable. Perplexity increases 5–12% versus FP16. For chat, casual writing, and summarization, most users find Q4 quality adequate. For coding, structured reasoning, and math, the degradation is more noticeable.
What you lose:
- Coding: Logical errors appear more frequently on complex functions. Variable naming becomes slightly less consistent. Multi-step algorithms sometimes contain subtle bugs that Q6+ avoids. For simple scripts and boilerplate, Q4 works well.
- Reasoning: Chain-of-thought accuracy drops. Models at Q4 are more likely to make arithmetic errors or lose track of complex logical dependencies over long reasoning chains.
- Nuanced writing: The prose is still fluent, but precision of word choice and stylistic consistency degrade slightly. Most users don't notice.
- Factual recall: Marginal increase in hallucination rate. The model's compressed weights slightly blur the edges of its factual knowledge.
When to use it: When your VRAM requires it. Q4_K_M is the right choice when Q5 doesn't fit comfortably. It's the standard level on most model distribution platforms for good reason — it runs on the widest range of hardware with acceptable quality.
When NOT to use it: If you're doing serious coding work, complex reasoning tasks, or running a model that's already borderline on quality (like a very small 3B model), the quality loss at Q4 compounds. Use Q5 or Q6 instead.
Q3 and Q2: Last Resort Territory
Q3_K_M
Uses 0.44 bytes per parameter. 7B model = ~3.1 GB.
Quality loss is noticeable. Outputs become less coherent on complex tasks, instruction following degrades, and the model "sounds right but means less." Think of it as a model that's clearly a step down — you can still get useful outputs, but you'll notice.
Use only when: No other option. If a model barely doesn't fit at Q4, try Q3 before giving up — but also seriously consider running a smaller model at Q4 or Q5 instead.
Q2_K
Uses 0.31 bytes per parameter. 7B model = ~2.2 GB.
Significant quality loss. The model often loses coherence on anything complex. Responses can be repetitive, factually imprecise, or logically broken. In almost all cases, a smaller model at Q4 will outperform a larger model at Q2.
Use only when: There is no alternative. In practice, this almost never makes sense.
The Key Rule
A smaller model at a higher quantization level will usually outperform a larger model at a lower level.
A 7B model at Q6 typically outperforms a 13B model at Q2. A 14B model at Q5 typically outperforms a 30B model at Q3. Quality from quantization and quality from model size compound — and the baseline (the model's own capability) matters more than the compression level when the compression gets extreme.
Before dropping to Q3 to fit a model, ask: would a smaller model at Q5 give better results? The answer is usually yes.
Quantization Level Quick Reference
| Level | Bytes/Param | 7B VRAM | 70B VRAM | Quality vs FP16 |
|---|---|---|---|---|
| FP16 | 2.00 | 14 GB | 140 GB | Baseline |
| Q8_0 | 1.06 | 7.5 GB | 74 GB | ~99% (near identical) |
| Q6_K | 0.81 | 5.7 GB | 57 GB | ~97% (excellent) |
| Q5_K_M | 0.69 | 4.8 GB | 48 GB | ~95% (good) |
| Q4_K_M | 0.56 | 3.9 GB | 39 GB | ~90% (acceptable) |
| Q3_K_M | 0.44 | 3.1 GB | 31 GB | ~80% (noticeable loss) |
| Q2_K | 0.31 | 2.2 GB | 22 GB | ~70% (significant loss) |
Quality percentages are approximate and task-dependent. Chat tasks are more forgiving; coding and reasoning are less so.
How to Choose
You have plenty of VRAM (model uses <60% of your VRAM): Use Q6_K or Q8_0.
Comfortable fit (model uses 60–80% of VRAM): Use Q5_K_M.
Tight fit (model uses 80–95% of VRAM): Use Q4_K_M.
Barely fits at Q4 (model uses >95% of VRAM): Either use Q3_K_M (accept degradation) or switch to a smaller model at Q5–Q6.
Doesn't fit at all: CPU offload at Q4, or choose a smaller model.
Not sure what fits your hardware? The hardware calculator tells you which models and quantization levels work for your specific GPU or Mac.
Check your hardware compatibility → | Browse models by VRAM requirement →
Related: VRAM requirements for LLMs | GGUF quantization deep dive | Best GPU for home AI