Mistral VRAM Requirements (2026) — Will Your GPU Run 7B, Nemo 12B, Codestral 22B or Small 24B?
Mistral 7B needs ~4.3GB at Q4. Nemo 12B ~7.1GB. Codestral 22B ~12.8GB. Small 24B ~13.4GB. Full Q4/Q5/Q6/Q8 VRAM tables with 8GB/12GB/24GB GPU picks and Mac fit.
If you are searching for Mistral 7B VRAM requirements or Mistral Small 24B VRAM requirements, this guide gives exact numbers by quantization level, plus the GPU and Mac hardware that fits.
If you actually mean the newer 119B model, use the dedicated landing page here: Mistral Small 4 VRAM Requirements.
Mistral AI has built one of the most efficient open-weight model families available. From the scrappy 7B that punches well above its class to purpose-built coding specialists like Codestral and Devstral, there's a Mistral model for almost any hardware setup. But knowing which one fits your GPU - and at what quantization level - takes a bit of research.
This guide covers the full Mistral model lineup with exact VRAM requirements, hardware recommendations, and quick-start commands.
Fast reference
- Mistral 7B: ~4.0 GB at Q4_K_M
- Mistral Nemo 12B: ~6.8 GB at Q4_K_M
- Codestral 22B: ~12.3 GB at Q4_K_M
- Mistral Small 24B: ~13.4 GB at Q4_K_M
- Devstral Small 24B: ~13.4 GB at Q4_K_M
The Mistral Model Family
Mistral AI has taken a focused approach: fewer, better models rather than a sprawling lineup. Their open-weight releases cluster around a few key sizes:
- Mistral 7B — the original, still one of the best 7B models available
- Mistral Nemo 12B — a mid-range option co-developed with NVIDIA
- Mistral Small 24B — instruction following and general assistant tasks
- Codestral 22B — code generation, completion, and explanation
- Devstral Small 24B — agentic coding workflows
- Mistral Large — the closed/commercial frontier model (discussed below)
The most important thing to understand about Mistral models is their architecture efficiency. Mistral 7B uses sliding window attention and grouped query attention, which means it delivers better throughput and handles longer contexts more efficiently than comparable models at the same parameter count.
VRAM Requirements by Model
Here's what each model needs at different quantization levels:
| Model | Params | Q4_K_M | Q5_K_M | Q6_K | Q8_0 | F16 |
|---|---|---|---|---|---|---|
| Mistral 7B | 7.2B | 4.0 GB | 5.0 GB | 5.9 GB | 7.6 GB | 14.4 GB |
| Mistral Nemo 12B | 12.2B | 6.8 GB | 8.4 GB | 9.9 GB | 12.9 GB | 24.4 GB |
| Codestral 22B | 22.2B | 12.3 GB | 15.2 GB | 17.8 GB | 23.3 GB | 44.4 GB |
| Mistral Small 24B | 24.0B | 13.4 GB | 16.5 GB | 19.4 GB | 25.4 GB | 48.0 GB |
| Devstral Small 24B | 24.0B | 13.4 GB | 16.5 GB | 19.4 GB | 25.4 GB | 48.0 GB |
Note: These are model weight sizes. Add ~1-2 GB for KV cache and runtime overhead at default context lengths. Codestral supports a 32K context window, which increases KV cache requirements significantly at longer contexts.
Mistral Large is not included — it's a closed API model that Mistral doesn't release as open weights for local inference. If you need Mistral Large-class quality locally, Mistral Small 24B at Q6+ gets surprisingly close.
Hardware Recommendations by Model
Mistral 7B — The Efficient Workhorse
Mistral 7B was a watershed release. When it launched in late 2023 it outperformed Llama 2 13B on most benchmarks at half the parameter count. That efficiency advantage still holds today.
At Q4_K_M it needs just 4 GB of VRAM, making it one of the only models that fits comfortably in dedicated GPU memory on integrated graphics and entry-level cards.
Recommended hardware:
- RTX 4060 8GB — fits at Q8 with room to spare
- RTX 4070 12GB — comfortable at Q8, fast throughput
- Any Mac with 8GB+ unified memory
Quick start:
ollama run mistral:7b
Check compatibility: Mistral 7B on RTX 4060 | Mistral 7B on RTX 4070
When to choose Mistral 7B: you have a budget GPU or you need maximum tokens per second for high-volume tasks. It's a workhorse, not a showpiece.
Mistral Nemo 12B — The Underrated Mid-Ranger
Mistral Nemo 12B was a collaboration between Mistral and NVIDIA, and it shows. The model uses a 128K context window — far beyond what most models its size offer — and delivers strong multilingual performance.
At Q4_K_M it needs 6.8 GB, which fits on any 8GB GPU with room for context. At Q6_K it needs ~10 GB, making the 12GB RTX 4070 a natural match.
Recommended hardware:
- RTX 4060 Ti 8GB — fits at Q4 with tight margins
- RTX 4070 12GB — fits at Q6, good quality
- RTX 4070 Ti Super 16GB — Q8 with headroom for long contexts
- Mac M3 Pro 18GB — Q6 comfortably
Quick start:
ollama run mistral-nemo:12b
The 128K context window is a genuine differentiator. If you're processing long documents, chat histories, or large codebases, Nemo handles it at a hardware cost you can afford.
Codestral 22B — The Code Specialist
Codestral 22B is Mistral's dedicated code model. Trained specifically on code data, it supports 80+ programming languages and is optimized for fill-in-the-middle (FIM) completion — the technique that powers inline code suggestions in editors like VS Code and Neovim.
At Q4_K_M it needs 12.3 GB. That's a tight fit for 12GB cards but a comfortable one for the 16GB RTX 4070 Ti Super.
Recommended hardware:
- RTX 4070 Ti Super 16GB — fits at Q4, best mainstream choice
- RTX 4080 Super 16GB — Q4 with excellent speed
- RTX 4090 24GB — Q6+ for best code quality
- Mac M3 Pro 18GB — Q4 fits comfortably
- Mac M4 Pro 24GB — Q6 with room for long context
Quick start:
ollama run codestral:22b
Check compatibility: Codestral 22B on RTX 4070 Ti Super | Codestral 22B on RTX 4090
Codestral's FIM capability means it works with Continue, Cursor, and similar coding assistants that support local models. If you're setting up a local AI coding workflow, this is the model to start with.
Mistral Small 24B — Best-in-Class Instruction Following
Mistral Small 24B is Mistral's flagship open-weight general assistant. It's optimized for instruction following, function calling, and multi-turn conversation. On several instruction-following benchmarks it outperforms models significantly larger than it.
At Q4_K_M it needs 13.4 GB, which puts it just over the limit for 12GB cards but squarely within reach for 16GB GPUs.
Recommended hardware:
- RTX 4070 Ti Super 16GB — fits at Q4, solid performance
- RTX 4080 Super 16GB — Q4 with faster decode
- RTX 4090 24GB — Q6+ for noticeably better quality
- RTX 5090 32GB — Q6+ comfortably, future-proof
- Mac M4 Pro 24GB — Q5 fits with some headroom
- Mac M4 Max 36GB — Q6+ easily
Quick start:
ollama run mistral-small:24b
Check compatibility: Mistral Small 24B on RTX 4090
Mistral Small 24B is the model to choose when you need reliable instruction following, structured output, or function calling — tasks where smaller models often struggle with format consistency.
Devstral Small 24B — Built for Agentic Coding
Devstral Small 24B is the newest member of the Mistral coding family. Where Codestral focuses on code completion and generation, Devstral is designed for agentic workflows: multi-step coding tasks, tool use, and autonomous software engineering agents.
It shares Mistral Small 24B's parameter count and therefore identical VRAM requirements: 13.4 GB at Q4_K_M.
Recommended hardware:
- RTX 4070 Ti Super 16GB — fits at Q4, practical choice
- RTX 4090 24GB — Q5-Q6 for best agentic performance
- Mac M4 Pro 24GB — Q5 comfortably
- Mac M4 Max 36GB — Q6+ without compromise
Quick start:
ollama run devstral:24b
Devstral shines when you're running frameworks like LangChain, AutoGen, or open-source coding agents like SWE-agent. Its function calling and tool use capabilities are tuned for multi-step reasoning over codebases.
Choosing the Right Quantization for Mistral Models
Unlike reasoning models such as DeepSeek R1 — which are very sensitive to quantization precision — Mistral's chat and code models are fairly robust. They tolerate Q4 quantization well for most tasks.
General guidance:
- Q6_K — best quality, use when VRAM allows
- Q5_K_M — good balance, minimal quality loss over Q6
- Q4_K_M — the standard choice, excellent for most use cases
- Q3_K_M — noticeable degradation, only when you have no choice
For Codestral and Devstral specifically, stay at Q4 or above. Code generation is more sensitive to precision than general conversation — subtle errors in logic or syntax don't show up until the code fails.
For more details on quantization trade-offs, read our quantization guide.
Mistral vs the Competition
How does Mistral's lineup compare to alternatives at similar sizes?
| Model | Params | VRAM (Q4) | Strengths | Best For |
|---|---|---|---|---|
| Mistral 7B | 7.2B | 4 GB | Efficiency, speed | Fast general tasks |
| Llama 3.1 8B | 8.0B | 4.5 GB | Broad capability | General assistant |
| Mistral Nemo 12B | 12.2B | 6.8 GB | Long context, multilingual | Document processing |
| Codestral 22B | 22.2B | 12.3 GB | FIM, 80+ languages | Code completion |
| Mistral Small 24B | 24.0B | 13.4 GB | Instruction following | Assistants, function calling |
| Qwen3 30B | 30.0B | 16.8 GB | Reasoning + chat | Versatile tasks |
Mistral 7B remains competitive with Llama 3.1 8B despite being older — a testament to its architecture efficiency. At the 24B tier, Mistral Small holds its own against much heavier models.
Performance Expectations
Decode speed is primarily determined by your GPU's memory bandwidth. Here are approximate throughput numbers at Q4_K_M:
| Hardware | Mistral 7B | Mistral Nemo 12B | Codestral 22B | Mistral Small 24B |
|---|---|---|---|---|
| RTX 4060 8GB | ~50 tok/s | ~30 tok/s | — | — |
| RTX 4070 12GB | ~65 tok/s | ~38 tok/s | — | — |
| RTX 4070 Ti Super 16GB | ~70 tok/s | ~42 tok/s | ~28 tok/s | ~25 tok/s |
| RTX 4090 24GB | ~105 tok/s | ~62 tok/s | ~42 tok/s | ~38 tok/s |
| Mac M4 Pro 24GB | ~42 tok/s | ~25 tok/s | ~17 tok/s | ~15 tok/s |
| Mac M4 Max 64GB | ~50 tok/s | ~30 tok/s | ~20 tok/s | ~18 tok/s |
Approximate values with Q4_K_M quantization. Actual performance varies by runtime, context length, and system configuration.
Apple Silicon delivers lower peak token throughput than high-end NVIDIA GPUs, but unified memory is a major advantage: you can run larger models at higher quality without the hard VRAM ceiling that limits discrete GPUs.
Setting Up Mistral Models Locally
The easiest path to running any Mistral model is Ollama.
1. Install Ollama:
curl -fsSL https://ollama.com/install.sh | sh
2. Pull and run your chosen model:
# General assistant
ollama run mistral-small:24b
# Coding tasks
ollama run codestral:22b
# Agentic coding
ollama run devstral:24b
# Mid-range with long context
ollama run mistral-nemo:12b
# Lightweight / fastest
ollama run mistral:7b
3. For editor integration (VS Code, Neovim):
If you want Codestral's fill-in-the-middle completion in your editor, pair it with Continue or tabbynine and point them at your local Ollama endpoint (http://localhost:11434).
4. Check your hardware first:
Before downloading multi-gigabyte models, use our VRAM calculator to confirm your GPU can handle your target model at your preferred quantization level.
Mistral Large — What About the Frontier Model?
Mistral Large is Mistral's closed commercial model. It's not available as open weights — you access it through the Mistral API or through partner providers. If you need Mistral Large capability for local inference, the closest alternative is Mistral Small 24B at Q6+, which captures a significant portion of Large's instruction-following advantage at a fraction of the hardware cost.
For local frontier-class models, consider DeepSeek R1 32B or Qwen3 30B — both are open-weight, fit on high-end consumer GPUs, and compete with commercial frontier models on reasoning tasks.
Next Steps
- Check your hardware compatibility — see which Mistral model fits your GPU
- Compare Mistral Small vs Codestral vs Devstral
- Browse all Mistral models
- Read our VRAM requirements reference
- Explore GPU recommendations for local AI
- DeepSeek R1 GPU requirements — if you're also evaluating reasoning models