GGUF (GPT-Generated Unified Format) is the standard file format for quantized AI models. Created by Georgi Gerganov for llama.cpp, it packages model weights, tokenizer, and metadata in a single file. It's supported by Ollama, LM Studio, llama.cpp, and most local AI tools.

What's the best quantization for quality?

Q6_K offers the best quality-to-size ratio — nearly indistinguishable from F16 while using only 40% of the memory. If you have the VRAM, Q8 is essentially lossless. For most users, Q5_K_M is the sweet spot of quality and efficiency.

Does Q4 quantization hurt model quality?

Q4_K_M causes moderate quality loss that's acceptable for most tasks including chat and general use. For coding and reasoning, the loss is more noticeable. If your model barely fits at Q4, consider a smaller model at Q6 instead — a higher-quality smaller model often outperforms a lower-quality larger one.

Can I convert models to GGUF myself?

Yes. The llama.cpp repository includes conversion scripts. Use 'python convert_hf_to_gguf.py' to convert HuggingFace models, then 'quantize' to apply different quant levels. Most popular models already have pre-quantized GGUF versions available.

What's the difference between Q4_K_M and Q4_K_S?

The suffix indicates quantization granularity. K_M (medium) uses mixed precision — keeping attention layers at higher precision while compressing feed-forward layers more. K_S (small) applies more aggressive compression throughout. K_M is typically 5-10% larger but noticeably better quality.

Is Q8 quantization worth the extra VRAM?

Q8 is worth it if you have the VRAM headroom. It's essentially lossless compared to F16 while using about half the memory. The jump from Q6 to Q8 costs about 30% more VRAM for minimal quality gain, so Q6_K is often the better choice for most people.

April 20, 2026quantization, gguf, technical, vram

Q4_K_M vs Q5_K_M vs Q8 — Which GGUF Quantization Should You Use? (2026 Guide)

Q4_K_M saves 72% VRAM with minor quality loss. Q5_K_M is the sweet spot. Q8 is near-lossless but 2x larger. Full comparison with VRAM, perplexity, and speed by model size.

Quantization is the technology that makes running large AI models on consumer hardware possible. Without it, a 70B parameter model would require roughly 140GB of VRAM — more than most data center GPUs hold, let alone a gaming PC. With Q4 quantization, that same model fits in under 40GB. At Q3, it can run on a single high-end consumer GPU.

This guide explains how quantization works, what the different GGUF quant levels mean in practice, and which one you should choose for your hardware and use case.

What Is Quantization?

Every AI model is made up of billions of numerical values called weights. These weights are the learned parameters that determine how the model responds to input. By default, each weight is stored as a 16-bit floating point number (F16 or BF16), using 2 bytes of memory per parameter.

Quantization reduces the precision of these numbers. Instead of storing each weight as a 16-bit float, you store it as an 8-bit integer, a 4-bit integer, or even a 2-bit value. The trade-off is straightforward: less precision means smaller files and lower memory requirements, but also some loss in model output quality.

A useful analogy is JPEG compression for images. A raw photograph and a JPEG of the same image look nearly identical to most viewers, but the JPEG is a fraction of the file size. The compression discards fine detail that most people never notice. Quantization works similarly — it discards numerical precision that, in many cases, has minimal effect on the model's outputs.

The reason this works at all is that neural networks are surprisingly robust to reduced precision. The important information in a model is distributed across billions of parameters, so small errors in individual weights tend to average out. This redundancy means you can aggressively compress weights without catastrophic quality loss — at least down to a point.

The challenge is that not all weights are equally sensitive to precision loss. Attention layers — the parts of the model responsible for tracking relationships between tokens — tend to be more sensitive. Feed-forward layers, which do most of the bulk computation, tolerate compression better. Modern quantization methods like K-quants exploit this by applying different precision levels to different parts of the model.

Quantization Levels Explained

F16 (16-bit floating point)

The baseline. Every weight is stored at full precision as a 16-bit float, using exactly 2 bytes per parameter. A 7B model at F16 requires approximately 14GB of VRAM. Quality is the theoretical maximum for inference — no information is lost compared to how the model was trained (assuming it was trained in F16 or BF16).

F16 is rarely practical for consumer hardware on larger models. It's primarily useful for fine-tuning, where precision matters more, or for benchmarking to establish a quality ceiling.

Q8_0 (8-bit quantization)

At ~1.06 bytes per parameter, Q8_0 is approximately 47% smaller than F16. For a 7B model, that's about 7.4GB — well within reach of a 12GB GPU.

Quality loss at Q8 is so minimal it's essentially unmeasurable in most benchmarks. The difference between Q8 output and F16 output is statistically indistinguishable for almost all practical tasks. If you have the VRAM for Q8, you're getting lossless quality at half the memory cost.

Q6_K (6-bit K-quant)

At ~0.81 bytes per parameter, Q6_K uses about 60% less memory than F16. A 7B model fits in ~5.7GB, a 14B model in ~11.3GB.

Quality is excellent — very slight degradation from F16, typically imperceptible in real-world use. Q6_K is the recommended choice when you have comfortable VRAM headroom. It threads the needle between Q8's near-lossless quality and Q5's better memory efficiency.

Q5_K_M (5-bit K-quant, medium)

At ~0.69 bytes per parameter, Q5_K_M uses 65% less memory than F16. A 7B model fits in ~4.8GB, making it viable on 6GB GPUs with some overhead room.

Quality is good — slightly more loss than Q6_K but still strong for most tasks. Q5_K_M is the sweet spot for users who want near-Q6 quality with meaningfully better memory efficiency. It's the recommended default when your GPU has limited headroom.

Q4_K_M (4-bit K-quant, medium)

The most popular quant level. At ~0.56 bytes per parameter, Q4_K_M is 72% smaller than F16. A 7B model fits in ~3.9GB, a 70B model in ~39GB.

Quality loss is moderate and acceptable for most everyday tasks. Chat, summarization, and general assistance all hold up well at Q4. Coding and reasoning tasks show more noticeable degradation. Q4_K_M is the default choice on Ollama and most model distribution platforms for good reason — it runs almost everywhere and the quality is good enough for the vast majority of users.

Q3_K_M (3-bit K-quant, medium)

At ~0.44 bytes per parameter, Q3_K_M squeezes 78% out compared to F16. A 7B model needs only ~3.1GB, a 70B model ~30.8GB.

Quality loss is noticeable. Outputs become less coherent, instruction following degrades, and complex reasoning breaks down more frequently. Q3 is the territory of "runs but not great" — useful when a model barely doesn't fit at Q4 and you're willing to accept degraded quality over switching models. Consider it a last resort before going to an entirely different model.

Q2_K (2-bit quantization)

At ~0.31 bytes per parameter, Q2_K achieves 84% size reduction. A 7B model uses only ~2.2GB.

Quality loss is significant. Models at Q2 often struggle with basic coherence on complex tasks. The output can feel noticeably "off" — repetitive, less accurate, prone to hallucination. Q2 is emergency territory: use it only if no other option exists, and expect meaningful degradation. In most cases, running a smaller model at a higher quant is the better choice.

What Is GGUF?

GGUF stands for GPT-Generated Unified Format. It's the file format that makes quantized local AI models practical.

The format was created by Georgi Gerganov, the developer behind llama.cpp — the open source inference engine that first made running large language models on consumer CPUs viable. GGUF was introduced in 2023 as a replacement for the older GGML format, which had grown unwieldy and required separate files for weights and configuration.

GGUF's key design principle is self-containment. A single .gguf file includes everything needed to run a model: the quantized weights, the tokenizer vocabulary, the tokenizer configuration, model architecture metadata, and hyperparameters. You download one file, and any compatible runtime can load and run it immediately — no separate config files, no tokenizer downloads, no architecture lookups.

This simplicity drove GGUF's rapid adoption. Today it's the standard format for local model distribution, supported natively by:

Ollama — the most popular local model runner
LM Studio — the leading GUI for local AI
llama.cpp — the underlying inference engine
text-generation-webui — the community's browser-based interface
Jan — an offline-first AI assistant

When you browse HuggingFace for a popular model like Llama, Qwen, or Mistral, you'll typically find a dedicated GGUF repository (often from TheBloke or the model author) containing multiple quantization variants. Each file is named with its quant level, e.g., model-name-Q4_K_M.gguf.

VRAM Requirements by Quantization Level

This table shows how much VRAM each quantization level requires for common model sizes, plus an approximate quality rating.

Quant	Bytes/Param	7B Model	14B Model	70B Model	Quality
F16	2.00	14.0 GB	28.0 GB	140 GB	Baseline
Q8_0	1.06	7.4 GB	14.8 GB	74.2 GB	Excellent
Q6_K	0.81	5.7 GB	11.3 GB	56.7 GB	Very good
Q5_K_M	0.69	4.8 GB	9.7 GB	48.3 GB	Good
Q4_K_M	0.56	3.9 GB	7.8 GB	39.2 GB	Acceptable
Q3_K_M	0.44	3.1 GB	6.2 GB	30.8 GB	Noticeable loss
Q2_K	0.31	2.2 GB	4.3 GB	21.7 GB	Significant loss

Important: Add 1–2GB for KV cache and runtime overhead. For smaller models running on less than 8GB of VRAM, this overhead is proportionally larger. A 7B model at Q4_K_M might show 3.9GB for weights but require 5–6GB total in practice.

Also note that model sizes vary. A "7B model" might have anywhere from 6.7B to 8.0B parameters depending on the architecture. The table gives representative numbers — check the specific model's GGUF file sizes for exact VRAM requirements.

K-Quants: What Do the Suffixes Mean?

The naming convention for GGUF files can look cryptic at first. Here's what each part means.

The K in Q4_K_M stands for "k-quant" — a family of quantization methods developed by the llama.cpp team that significantly improved quality at low bit depths compared to naive quantization. K-quants use a technique called block quantization with learned scaling factors, applied at finer granularity than simple per-tensor quantization.

The key innovation in K-quants is mixed precision: different parts of the model are quantized to different bit depths. Attention layers, which are more sensitive to precision, are kept at higher bit depth. Feed-forward network layers, which tolerate compression better, are compressed more aggressively. This means a Q4_K_M model is meaningfully better than a model where every weight is simply rounded to 4 bits.

The final suffix indicates the compression aggressiveness:

K_S (small): More aggressive compression across the board. Smallest file size, lowest quality within the bit depth.
K_M (medium): Balanced approach. The most commonly recommended variant. Usually the best quality-to-size ratio.
K_L (large): Less aggressive compression. Larger file, better quality — but the gap from K_M is smaller than the gap from K_M to K_S.

In practice, K_M is almost always the right choice. The size difference between K_S and K_M is modest (typically 5–10%), while the quality difference is noticeable. K_L files exist mainly for users who are splitting hairs at the top end of quality and have the VRAM to spare.

When you see a quant like Q4_0 (without the K), that's an older naive quantization method — same bit depth but without the mixed-precision improvements. Always prefer K-quants when available.

Which Quantization Should You Use?

The right choice depends primarily on how much VRAM you have relative to the model's requirements.

You have plenty of VRAM (model uses less than 60% of your VRAM)

Use Q6_K or Q8_0. You have the headroom, so use it for quality. Q6_K is the better default here — the quality gain from Q6 to Q8 is marginal, and Q6 leaves more VRAM available for context (KV cache) and other processes.

Comfortable fit (model uses 60–80% of your VRAM)

Use Q5_K_M. This is often the sweet spot: noticeably better than Q4, meaningfully more efficient than Q6. If your model fits comfortably at Q5_K_M, this is probably your ideal choice.

Tight fit (model uses 80–95% of your VRAM)

Use Q4_K_M. This is the most widely tested and recommended quant for constrained hardware. Quality is acceptable for most use cases, and it's well-supported across all runtimes.

Barely fits (model uses more than 95% of your VRAM)

Consider Q3_K_M, but also reconsider your model choice. A model that barely fits at its smallest practical quant may run slowly, leave no room for meaningful context, and produce degraded output. At this point, running a smaller model at Q5 or Q6 is often the better experience. A sharp 8B model at Q6 will outperform a mediocre 14B model at Q3.

Doesn't fit at all

You have two options: offload layers to CPU (slow, potentially very slow depending on how many layers don't fit) or switch to a smaller model. Offloading works, but if more than 20–30% of layers are on CPU, generation speed drops dramatically. The calculator can help you find models that actually fit your hardware.

The key rule: a smaller model at a higher quant often beats a larger model at a lower quant. A 7B model at Q6_K will frequently produce better results than a 13B model at Q2_K, and it'll run faster too.

How Quantization Affects Different Tasks

Not all tasks are equally sensitive to quantization. Knowing where quality loss matters helps you choose the right quant for your specific workload.

Coding

Code generation is the most sensitive task. Logic is precise — a single wrong token can break a function. Attention to variable names, bracket matching, and algorithm structure all require the kind of nuanced pattern matching that degrades at lower quants. For coding models like Qwen 3 8B or dedicated code models, aim for Q5_K_M or higher. Q4 is workable for simple completions but shows cracks on complex multi-file tasks.

Chat and conversation

Conversational use is the least sensitive. Human language is inherently redundant and tolerant of small variations. A slightly imprecise word choice or a minor coherence lapse is forgiven in conversation in a way it isn't in code. Q4_K_M is perfectly adequate for chat. Most users won't notice the difference between Q4 and Q6 in casual conversation.

Reasoning and mathematics

Chain-of-thought reasoning — the kind used by models like DeepSeek R1 — requires maintaining precision across long sequences. A small error early in a reasoning chain can cascade into a wrong answer. Math problems are similarly unforgiving. For reasoning-heavy workloads, prefer Q5_K_M or Q6_K. The quality difference at Q4 is measurable on benchmarks like MATH and GSM8K.

Creative writing

Moderate sensitivity. The nuance that makes creative writing feel alive — word choice, rhythm, subtle characterization — can degrade at lower quants. Q4_K_M is acceptable for most creative tasks. If you're doing serious creative work where output quality matters, Q5 or Q6 is worth the extra VRAM.

Summarization and extraction

Among the least sensitive tasks. Taking a long document and condensing it, or extracting structured information, relies on broad pattern recognition rather than fine precision. Q4_K_M handles summarization well. Q3 is acceptable in a pinch.

MoE Models and Quantization

Mixture-of-Experts (MoE) models introduce a wrinkle in the standard quantization calculus. Models like Qwen 3 30B-A3B have 30 billion total parameters but only ~3 billion active parameters per forward pass. The router mechanism selects which "expert" sub-networks handle each token.

For memory planning, what matters is the total parameter count — all experts must be loaded into VRAM, even if only a fraction are active at any given moment. So a 30B MoE model still requires the memory budget of a ~30B model, not a 3B model.

However, the quality of a MoE model is primarily determined by its active parameters. Because the model's capacity is distributed across many specialized experts, each expert doesn't need to be individually as precise as a dense model's layers. This means MoE models tolerate aggressive quantization somewhat better than equivalent dense models.

In practice, this means you can often run MoE models at Q4_K_M and get quality that feels more like a larger dense model. The Qwen 3 30B-A3B at Q4, for instance, performs like a quality 8–10B dense model in terms of response quality, while fitting in the VRAM budget of a much smaller dense model.

The caveat is that very aggressive quantization (Q2, Q3) on MoE models can degrade routing quality, which affects which experts get selected — a different failure mode than in dense models. Stay at Q4_K_M or above for MoE models in production use.

Converting Models to GGUF

Most popular models already have pre-quantized GGUF versions on HuggingFace, so conversion is rarely necessary. But if you need to quantize a model yourself — perhaps a new release before the community has packaged it, or a fine-tune you trained yourself — the process is straightforward.

The llama.cpp repository provides all the necessary tools:

# Step 1: Convert a HuggingFace model to GGUF (F16)
python convert_hf_to_gguf.py /path/to/hf-model --outtype f16 --outfile model-f16.gguf

# Step 2: Quantize to your target level
./quantize model-f16.gguf model-Q5_K_M.gguf Q5_K_M

The quantize binary accepts any of the standard quant level names (Q4_K_M, Q5_K_S, Q6_K, Q8_0, etc.) as the final argument. You can generate multiple quant levels from a single F16 GGUF without re-running the conversion step.

For models already on HuggingFace, search for the model name plus "GGUF" — most popular models have dedicated GGUF repositories. HuggingFace also supports direct GGUF downloads via the huggingface-cli:

huggingface-cli download TheBloke/Llama-2-7B-GGUF llama-2-7b.Q5_K_M.gguf

Practical Recommendations by Hardware

To make this concrete, here's what to use on common hardware configurations:

4GB VRAM (GTX 1650, RX 6500 XT): Q4_K_M on 3B models only. For anything larger, you'll need CPU offloading or to use cloud inference.

6GB VRAM (RTX 3060, RX 6600): Q4_K_M on 7B models (tight fit). Q5_K_M on 3B models. No room for 14B+ models without heavy offloading.

8GB VRAM (RTX 3070, RX 6700 XT): Q5_K_M on 7B models. Q4_K_M on 8B models. Small 14B models at Q3 (acceptable quality trade-off).

12GB VRAM (RTX 3080 12GB, RTX 4070): Q6_K on 7B models. Q5_K_M on 14B models. Q4_K_M on 14B with room to spare.

16GB VRAM (RTX 4080, RX 7900 GRE): Q8_0 on 7B models. Q6_K on 14B models. Q4_K_M on 30B MoE models.

24GB VRAM (RTX 3090, RTX 4090): Q8_0 on 14B models. Q6_K on 30B+ models. Q4_K_M on 70B models (tight).

Apple Silicon (unified memory): The memory bandwidth advantage of Apple Silicon partially offsets the smaller VRAM numbers. An M2 Pro with 16GB can run 14B models at Q6_K more comfortably than a discrete GPU with the same memory, because there's no PCIe bottleneck between CPU memory and GPU compute.

For a personalized recommendation based on your exact hardware, use the hardware calculator — it accounts for your specific GPU's VRAM, bandwidth, and architecture when recommending which models and quant levels will run well.

Summary

Quantization makes local AI practical by trading precision for memory efficiency. The GGUF format packages quantized models in a single portable file that any modern local inference runtime can load.

The right quant level is primarily a function of VRAM budget:

Q8_0 / Q6_K: Use when you have the VRAM. Near-lossless quality.
Q5_K_M: The sweet spot. Recommended default for constrained hardware.
Q4_K_M: The mainstream choice. Good for most tasks, acceptable quality.
Q3_K_M: Last resort. Noticeable degradation.
Q2_K: Emergency only. Significant quality loss.

When in doubt, the guiding principle is: it's better to run a smaller model well than a larger model badly. A 7B model at Q6 will often outperform a 13B model at Q2, runs faster, and leaves more room for context.

Check which models fit your hardware | Browse all models by VRAM requirement