Will It Run AI
mistral, vram, gpu-requirements, coding

Mistral VRAM Requirements (2026) — Will Your GPU Run 7B, Nemo 12B, Codestral 22B or Small 24B?

Mistral 7B needs ~4.3GB at Q4. Nemo 12B ~7.1GB. Codestral 22B ~12.8GB. Small 24B ~13.4GB. Full Q4/Q5/Q6/Q8 VRAM tables with 8GB/12GB/24GB GPU picks and Mac fit.

If you are searching for Mistral 7B VRAM requirements or Mistral Small 24B VRAM requirements, this guide gives exact numbers by quantization level, plus the GPU and Mac hardware that fits.

If you actually mean the newer 119B model, use the dedicated landing page here: Mistral Small 4 VRAM Requirements.

Mistral AI has built one of the most efficient open-weight model families available. From the scrappy 7B that punches well above its class to purpose-built coding specialists like Codestral and Devstral, there's a Mistral model for almost any hardware setup. But knowing which one fits your GPU - and at what quantization level - takes a bit of research.

This guide covers the full Mistral model lineup with exact VRAM requirements, hardware recommendations, and quick-start commands.

Fast reference

  • Mistral 7B: ~4.0 GB at Q4_K_M
  • Mistral Nemo 12B: ~6.8 GB at Q4_K_M
  • Codestral 22B: ~12.3 GB at Q4_K_M
  • Mistral Small 24B: ~13.4 GB at Q4_K_M
  • Devstral Small 24B: ~13.4 GB at Q4_K_M

The Mistral Model Family

Mistral AI has taken a focused approach: fewer, better models rather than a sprawling lineup. Their open-weight releases cluster around a few key sizes:

  • Mistral 7B — the original, still one of the best 7B models available
  • Mistral Nemo 12B — a mid-range option co-developed with NVIDIA
  • Mistral Small 24B — instruction following and general assistant tasks
  • Codestral 22B — code generation, completion, and explanation
  • Devstral Small 24B — agentic coding workflows
  • Mistral Large — the closed/commercial frontier model (discussed below)

The most important thing to understand about Mistral models is their architecture efficiency. Mistral 7B uses sliding window attention and grouped query attention, which means it delivers better throughput and handles longer contexts more efficiently than comparable models at the same parameter count.

VRAM Requirements by Model

Here's what each model needs at different quantization levels:

ModelParamsQ4_K_MQ5_K_MQ6_KQ8_0F16
Mistral 7B7.2B4.0 GB5.0 GB5.9 GB7.6 GB14.4 GB
Mistral Nemo 12B12.2B6.8 GB8.4 GB9.9 GB12.9 GB24.4 GB
Codestral 22B22.2B12.3 GB15.2 GB17.8 GB23.3 GB44.4 GB
Mistral Small 24B24.0B13.4 GB16.5 GB19.4 GB25.4 GB48.0 GB
Devstral Small 24B24.0B13.4 GB16.5 GB19.4 GB25.4 GB48.0 GB

Note: These are model weight sizes. Add ~1-2 GB for KV cache and runtime overhead at default context lengths. Codestral supports a 32K context window, which increases KV cache requirements significantly at longer contexts.

Mistral Large is not included — it's a closed API model that Mistral doesn't release as open weights for local inference. If you need Mistral Large-class quality locally, Mistral Small 24B at Q6+ gets surprisingly close.

Hardware Recommendations by Model

Mistral 7B — The Efficient Workhorse

Mistral 7B was a watershed release. When it launched in late 2023 it outperformed Llama 2 13B on most benchmarks at half the parameter count. That efficiency advantage still holds today.

At Q4_K_M it needs just 4 GB of VRAM, making it one of the only models that fits comfortably in dedicated GPU memory on integrated graphics and entry-level cards.

Recommended hardware:

  • RTX 4060 8GB — fits at Q8 with room to spare
  • RTX 4070 12GB — comfortable at Q8, fast throughput
  • Any Mac with 8GB+ unified memory

Quick start:

ollama run mistral:7b

Check compatibility: Mistral 7B on RTX 4060 | Mistral 7B on RTX 4070

When to choose Mistral 7B: you have a budget GPU or you need maximum tokens per second for high-volume tasks. It's a workhorse, not a showpiece.

Mistral Nemo 12B — The Underrated Mid-Ranger

Mistral Nemo 12B was a collaboration between Mistral and NVIDIA, and it shows. The model uses a 128K context window — far beyond what most models its size offer — and delivers strong multilingual performance.

At Q4_K_M it needs 6.8 GB, which fits on any 8GB GPU with room for context. At Q6_K it needs ~10 GB, making the 12GB RTX 4070 a natural match.

Recommended hardware:

Quick start:

ollama run mistral-nemo:12b

The 128K context window is a genuine differentiator. If you're processing long documents, chat histories, or large codebases, Nemo handles it at a hardware cost you can afford.

Codestral 22B — The Code Specialist

Codestral 22B is Mistral's dedicated code model. Trained specifically on code data, it supports 80+ programming languages and is optimized for fill-in-the-middle (FIM) completion — the technique that powers inline code suggestions in editors like VS Code and Neovim.

At Q4_K_M it needs 12.3 GB. That's a tight fit for 12GB cards but a comfortable one for the 16GB RTX 4070 Ti Super.

Recommended hardware:

Quick start:

ollama run codestral:22b

Check compatibility: Codestral 22B on RTX 4070 Ti Super | Codestral 22B on RTX 4090

Codestral's FIM capability means it works with Continue, Cursor, and similar coding assistants that support local models. If you're setting up a local AI coding workflow, this is the model to start with.

Mistral Small 24B — Best-in-Class Instruction Following

Mistral Small 24B is Mistral's flagship open-weight general assistant. It's optimized for instruction following, function calling, and multi-turn conversation. On several instruction-following benchmarks it outperforms models significantly larger than it.

At Q4_K_M it needs 13.4 GB, which puts it just over the limit for 12GB cards but squarely within reach for 16GB GPUs.

Recommended hardware:

Quick start:

ollama run mistral-small:24b

Check compatibility: Mistral Small 24B on RTX 4090

Mistral Small 24B is the model to choose when you need reliable instruction following, structured output, or function calling — tasks where smaller models often struggle with format consistency.

Devstral Small 24B — Built for Agentic Coding

Devstral Small 24B is the newest member of the Mistral coding family. Where Codestral focuses on code completion and generation, Devstral is designed for agentic workflows: multi-step coding tasks, tool use, and autonomous software engineering agents.

It shares Mistral Small 24B's parameter count and therefore identical VRAM requirements: 13.4 GB at Q4_K_M.

Recommended hardware:

Quick start:

ollama run devstral:24b

Devstral shines when you're running frameworks like LangChain, AutoGen, or open-source coding agents like SWE-agent. Its function calling and tool use capabilities are tuned for multi-step reasoning over codebases.

Choosing the Right Quantization for Mistral Models

Unlike reasoning models such as DeepSeek R1 — which are very sensitive to quantization precision — Mistral's chat and code models are fairly robust. They tolerate Q4 quantization well for most tasks.

General guidance:

  • Q6_K — best quality, use when VRAM allows
  • Q5_K_M — good balance, minimal quality loss over Q6
  • Q4_K_M — the standard choice, excellent for most use cases
  • Q3_K_M — noticeable degradation, only when you have no choice

For Codestral and Devstral specifically, stay at Q4 or above. Code generation is more sensitive to precision than general conversation — subtle errors in logic or syntax don't show up until the code fails.

For more details on quantization trade-offs, read our quantization guide.

Mistral vs the Competition

How does Mistral's lineup compare to alternatives at similar sizes?

ModelParamsVRAM (Q4)StrengthsBest For
Mistral 7B7.2B4 GBEfficiency, speedFast general tasks
Llama 3.1 8B8.0B4.5 GBBroad capabilityGeneral assistant
Mistral Nemo 12B12.2B6.8 GBLong context, multilingualDocument processing
Codestral 22B22.2B12.3 GBFIM, 80+ languagesCode completion
Mistral Small 24B24.0B13.4 GBInstruction followingAssistants, function calling
Qwen3 30B30.0B16.8 GBReasoning + chatVersatile tasks

Mistral 7B remains competitive with Llama 3.1 8B despite being older — a testament to its architecture efficiency. At the 24B tier, Mistral Small holds its own against much heavier models.

Performance Expectations

Decode speed is primarily determined by your GPU's memory bandwidth. Here are approximate throughput numbers at Q4_K_M:

HardwareMistral 7BMistral Nemo 12BCodestral 22BMistral Small 24B
RTX 4060 8GB~50 tok/s~30 tok/s
RTX 4070 12GB~65 tok/s~38 tok/s
RTX 4070 Ti Super 16GB~70 tok/s~42 tok/s~28 tok/s~25 tok/s
RTX 4090 24GB~105 tok/s~62 tok/s~42 tok/s~38 tok/s
Mac M4 Pro 24GB~42 tok/s~25 tok/s~17 tok/s~15 tok/s
Mac M4 Max 64GB~50 tok/s~30 tok/s~20 tok/s~18 tok/s

Approximate values with Q4_K_M quantization. Actual performance varies by runtime, context length, and system configuration.

Apple Silicon delivers lower peak token throughput than high-end NVIDIA GPUs, but unified memory is a major advantage: you can run larger models at higher quality without the hard VRAM ceiling that limits discrete GPUs.

Setting Up Mistral Models Locally

The easiest path to running any Mistral model is Ollama.

1. Install Ollama:

curl -fsSL https://ollama.com/install.sh | sh

2. Pull and run your chosen model:

# General assistant
ollama run mistral-small:24b

# Coding tasks
ollama run codestral:22b

# Agentic coding
ollama run devstral:24b

# Mid-range with long context
ollama run mistral-nemo:12b

# Lightweight / fastest
ollama run mistral:7b

3. For editor integration (VS Code, Neovim):

If you want Codestral's fill-in-the-middle completion in your editor, pair it with Continue or tabbynine and point them at your local Ollama endpoint (http://localhost:11434).

4. Check your hardware first:

Before downloading multi-gigabyte models, use our VRAM calculator to confirm your GPU can handle your target model at your preferred quantization level.

Mistral Large — What About the Frontier Model?

Mistral Large is Mistral's closed commercial model. It's not available as open weights — you access it through the Mistral API or through partner providers. If you need Mistral Large capability for local inference, the closest alternative is Mistral Small 24B at Q6+, which captures a significant portion of Large's instruction-following advantage at a fraction of the hardware cost.

For local frontier-class models, consider DeepSeek R1 32B or Qwen3 30B — both are open-weight, fit on high-end consumer GPUs, and compete with commercial frontier models on reasoning tasks.

Next Steps

Frequently Asked Questions

How much VRAM does Mistral 7B need?

Mistral 7B (7.2B parameters) needs approximately 4GB at Q4_K_M, 5GB at Q5_K_M, or 7.6GB at Q8. It's one of the most efficient 7B-class models and runs well on GPUs with 8GB+ VRAM.

Can I run Mistral Small 24B locally?

Yes. Mistral Small 24B needs ~13.4GB at Q4_K_M. An RTX 4070 Ti Super (16GB) handles it at Q4, and an RTX 4090 (24GB) runs it at Q6+ for better quality.

What is Codestral and how much VRAM does it need?

Codestral is Mistral's coding-specialized model at 22B parameters. It needs ~12.3GB at Q4_K_M. It's optimized for code generation, completion, and explanation, competing with models like Qwen 3 Coder.

Is Mistral better than Llama?

Mistral and Llama have different strengths. Mistral 7B is highly efficient at its size. Mistral Small 24B excels at instruction following. For coding, Codestral/Devstral are competitive. The best choice depends on your use case and hardware.

Which Mistral model is best for coding?

Devstral Small (24B) and Codestral (22B) are Mistral's coding-focused models. Devstral is designed specifically for agentic coding workflows. Both need ~12-13GB at Q4, making them good fits for RTX 4070 Ti Super or RTX 4090.