MiniMax M2.7 VRAM Requirements - 230B MoE Agentic Model Hardware Guide
Exact VRAM for MiniMax M2.7 (230B total, 10B active MoE) at Dynamic 4-bit, Q2, Q3, Q4, Q5, Q8, and BF16. Includes Mac, multi-GPU, and partial offload scenarios.
If you are searching for MiniMax M2.7 VRAM requirements, this is the focused answer.
Quick answers
- Dynamic 4-bit (UD-IQ4_XS): ~108 GB
- Q2_K: ~83 GB
- Q4_K_S: ~131 GB
- Q4_K_M: ~140 GB
- Q8_0: ~243 GB
- BF16: ~457 GB
MiniMax M2.7 is a 230B-parameter Sparse Mixture of Experts model that activates only 10B parameters per token. It uses 256 experts with 8 active per forward pass across 62 layers, and supports a 204,800-token context window. Released on April 12, 2026 under Apache 2.0, it is fully open-source with GGUF quantizations available from Unsloth.
M2.7 specializes in agentic and coding tasks. MiniMax describes it as a "self-evolving agent" and the benchmarks reflect that focus: 56.22% on SWE-Pro, 57.0% on Terminal Bench 2, and 55.6% on VIBE-Pro.
The catch: 230B total parameters means every expert must live in memory, even though only 10B fire per token. This makes VRAM planning critical.
MiniMax M2.7 VRAM by Quantization
These numbers are model weight sizes from Unsloth GGUF releases. Add 2-4 GB for KV cache and runtime overhead at moderate context lengths. The 204K context window can consume significantly more if you push it.
| Quantization | VRAM |
|---|---|
| Q2_K | 83 GB |
| Q3_K_S | 94 GB |
| UD-IQ4_XS (Dynamic 4-bit) | 108 GB |
| Q4_K_S | 131 GB |
| Q4_K_M | 140 GB |
| Q5_K_M | 169 GB |
| Q8_0 | 243 GB |
| BF16 | 457 GB |
Key takeaway: Even the most aggressive quantization (Q2_K) needs 83 GB. This is not a model you run on a single consumer GPU. But the MoE architecture means inference speed scales with the 10B active parameters, not the full 230B, so once it fits in memory it runs faster than you might expect.
What Hardware Runs MiniMax M2.7?
Apple Silicon Macs
Macs with large unified memory are the most practical consumer path for M2.7.
- 128 GB Mac (M2 Ultra, M4 Max): Dynamic 4-bit at ~108 GB fits with limited headroom. Expect around 15 tok/s. Context will be constrained.
- 192 GB Mac (M2 Ultra, M4 Max): Q4_K_M at ~140 GB is comfortable. Good balance of quality and speed with room for longer context.
- 256 GB Mac (M4 Ultra): Q8_0 at ~243 GB fits with headroom. This is the premium local experience.
Apple Silicon's unified memory architecture avoids the CPU-GPU transfer bottleneck that kills performance with partial offloading on discrete GPUs. For a 230B MoE model, this matters a lot.
Multi-GPU Setups
If you have access to data center or workstation GPUs, M2.7 becomes straightforward.
- 4x A100 80GB (320 GB total): Q8_0 at ~243 GB fits across the cluster with room for KV cache
- 2x A100 80GB (160 GB total): Q4_K_M at ~140 GB is realistic
- 4x RTX 3090/4090 24GB (96 GB total): Q2_K at ~83 GB is technically possible but leaves almost no headroom
For multi-GPU inference, use vLLM, TensorRT-LLM, or llama.cpp with tensor parallelism. The MoE routing adds some inter-GPU communication overhead, but the 10B active parameter count keeps compute requirements manageable.
Partial Offload (GPU + System RAM)
If you have a single GPU and plenty of system RAM, partial offloading through llama.cpp is an option.
- 1x 16GB GPU + 96GB RAM: Dynamic 4-bit is feasible. Offload most layers to RAM, keep a few on GPU. Expect around 2-5 tok/s depending on RAM bandwidth.
- 1x 24GB GPU + 128GB RAM: Same approach with slightly more GPU layers. Marginally faster but still RAM-bandwidth-bound.
Partial offload works, but the experience is significantly slower than full GPU or unified memory inference. The constant shuffling of expert weights between RAM and VRAM creates a bottleneck that MoE architectures are particularly sensitive to.
Single Consumer GPU Limitations
A single RTX 4090 (24 GB), RTX 5090 (32 GB), or any other consumer GPU cannot run MiniMax M2.7 from VRAM alone. The smallest quantization needs 83 GB. Without system RAM offloading, this model is out of reach for single-GPU desktop setups.
How Does MiniMax M2.7 Compare?
M2.7 sits in an interesting spot: smaller than DeepSeek V3.2 but with a similar MoE philosophy, and much larger than Qwen 3.5 122B-A10B while activating the same number of parameters.
| Model | Total Params | Active Params | Experts | VRAM (Q4-class) | VRAM (Q8) |
|---|---|---|---|---|---|
| MiniMax M2.7 | 230B | 10B | 256 (8 active) | ~131 GB (Q4_K_S) | ~243 GB |
| Qwen 3.5 122B-A10B | 122B | 10B | MoE | ~74 GB (Q4_K_M) | ~131 GB |
| DeepSeek V3.2 | 671B | 37B | 256 (8 active) | ~350 GB (Q4) | ~640 GB |
Against Qwen 3.5 122B-A10B: Both activate 10B parameters, but M2.7 has nearly twice the total parameter count (230B vs 122B). M2.7 needs roughly 1.8x the VRAM at comparable quantization. The extra capacity shows up in agentic and coding benchmarks where M2.7 pulls ahead. If you have the memory budget, M2.7 is the stronger model. If not, Qwen 3.5 122B-A10B is far easier to fit.
Against DeepSeek V3.2: DeepSeek V3.2 is nearly 3x larger in total parameters and activates 3.7x more per token. It needs roughly 2.5x the VRAM. M2.7 is substantially more accessible for local inference while still competing on agentic benchmarks.
Best Quantization for MiniMax M2.7
- Q2_K if you are memory-constrained and want the model to fit at all (83 GB minimum)
- UD-IQ4_XS (Dynamic 4-bit) for the best balance of quality and memory on 128 GB Macs or partial offload setups
- Q4_K_M if you have 192 GB+ and want solid everyday quality
- Q8_0 if you have 256 GB+ (Mac) or multi-GPU and want near-lossless inference
- BF16 only for research or evaluation with 512 GB+ available
For most local users targeting this model, Dynamic 4-bit or Q4_K_M will be the practical choice.
Bottom Line
MiniMax M2.7 is a serious agentic model that brings 230B-scale MoE to open source. The 10B active parameters give it surprisingly fast inference once it fits in memory, and the Apache 2.0 license removes deployment restrictions.
The hardware reality:
- On a 128 GB Mac, it runs at Dynamic 4-bit -- tight but functional
- On a 192-256 GB Mac, it becomes a strong local agentic coding assistant
- On multi-GPU setups (2-4x A100), it runs comfortably at Q4 to Q8
- On a single consumer GPU, it does not fit without system RAM offloading
If you want to check exact fit for your hardware, use the VRAM calculator or compare M2.7 against other models.