Best AI Models for 24GB VRAM — RTX 4090 & RTX 5090 (LLMs, Image & Video)
RTX 4090 (24GB) runs 30B models, Flux at FP8, and video generation. RTX 5090 (32GB) adds 70B MoE models. Complete guide with VRAM tables and speed estimates.
24GB and 32GB are the power tiers for local AI. The RTX 4090 (24GB) has been the gold standard for enthusiasts since 2022, and the RTX 5090 (32GB) extends that lead. These GPUs run frontier-class models, generate images with Flux, and handle video generation — all locally.
This guide ranks the best models for each GPU based on our 10-factor compatibility scoring.
Best LLMs for 24-32GB VRAM
Tier S: The Sweet Spot — MoE Models
MoE (Mixture of Experts) models are where 24-32GB GPUs truly shine. They load all expert weights into VRAM but only activate a fraction per token, giving frontier quality at fast speeds.
| Model | Total Params | Active | VRAM (Q4) | Speed (4090) | Why |
|---|---|---|---|---|---|
| Gemma 4 26B-A4B | 25.2B | 3.8B | ~15 GB | ~70 tok/s | 89% AIME, Apache 2.0 |
| Qwen 3 30B-A3B | 30B | 3B | ~17 GB | ~70 tok/s | Best coding MoE |
| Qwen 3.5 35B-A3B | 35B | 3B | ~20 GB | ~65 tok/s | Updated Qwen MoE (5090 sweet spot) |
Recommended pick: Gemma 4 26B-A4B — fits on both 24GB and 32GB at Q4, multimodal, Apache 2.0, fastest inference of the three.
ollama run gemma4:26b
Tier A: Dense 30B Models
Dense models use every parameter for every token. Slower than MoE at the same total size, but some tasks benefit from the uniform compute.
| Model | Params | VRAM (Q4) | VRAM (Q6) | Best For |
|---|---|---|---|---|
| Gemma 4 31B | 30.7B | ~18 GB | ~26 GB | Highest quality dense, 256K context |
| Qwen 3 32B | 32B | ~19 GB | ~28 GB | Strong all-rounder |
| DeepSeek R1 Distill 32B | 32B | ~18 GB | ~26 GB | Best reasoning at this size |
| Mistral Small 3 24B | 24B | ~14 GB | ~20 GB | Fast, good coding |
24GB GPU: These fit at Q4 with limited context headroom. 32GB GPU: These run at Q5-Q6 with room for long context — noticeable quality improvement.
Tier B: 14B at Maximum Quality
With 24GB, you can run 14B models at Q8 or even FP16 — effectively zero quality loss from quantization.
| Model | Params | VRAM (Q8) | VRAM (FP16) | Notes |
|---|---|---|---|---|
| Qwen 3 14B | 14B | ~16 GB | ~28 GB | Q8 on 24GB, FP16 on 32GB |
| Gemma 3 27B | 27B | ~29 GB | ~54 GB | Q8 only on 32GB |
Coding-Specific
| Model | VRAM (Q4) | Notes |
|---|---|---|
| Qwen 3 Coder 30B-A3B | ~17 GB | MoE, best coding efficiency |
| Qwen 3.5 27B | ~16 GB | Strong coding dense model |
| DeepSeek Coder V2 236B | — | Too large for single GPU |
Best Image Models for 24-32GB
| Model | FP16 | FP8 | Best On |
|---|---|---|---|
| Flux.1 Dev | 33 GB | 13 GB | Both — FP8 on 24GB, FP16 tight on 32GB |
| Flux.1 Schnell | 33 GB | 13 GB | Both — fastest Flux |
| Flux.2 Dev | ~35 GB | ~14 GB | Both at FP8 |
| SDXL 1.0 | 8 GB | — | Either — tons of headroom |
| SD 3.5 Large | ~10 GB | — | Either |
24GB: Flux runs perfectly at FP8 with room for ControlNet and LoRAs. SDXL runs at FP16 with massive headroom for batch generation.
32GB: Flux at FP16 is possible but tight. Better to use FP8 and allocate extra VRAM to larger batch sizes or higher resolutions.
Best Video Models for 24-32GB
| Model | FP16 | FP8 | Fits 24GB? | Fits 32GB? |
|---|---|---|---|---|
| LTX Video 2B | 14-22 GB | ~8-13 GB | Yes (FP16 at 768p) | Yes (FP16 at 720p) |
| AnimateDiff | 6-24 GB | — | Yes (short clips) | Yes |
| Wan Video 1.3B | 9-13 GB | ~6 GB | Yes | Yes |
| CogVideoX 5B | 25-36 GB | ~16 GB | FP8 only | FP8 comfortably |
| HunyuanVideo | 47-58 GB | ~8 GB* | FP8+tiling | FP8+tiling |
*HunyuanVideo at FP8 with temporal tiling can fit on 24GB but generation is slow. The 32GB GPU gives more room for longer clips.
RTX 4090 vs RTX 5090 — Which to Buy?
| Spec | RTX 4090 | RTX 5090 |
|---|---|---|
| VRAM | 24 GB GDDR6X | 32 GB GDDR7 |
| Bandwidth | 1,008 GB/s | ~1,792 GB/s |
| Price | ~$1,600 | ~$2,000 |
| Best 30B quant | Q4_K_M | Q5_K_M-Q6_K |
| Flux precision | FP8 | FP8 (headroom) |
Buy the 4090 if: Budget is a concern and Q4 quality is acceptable. It handles 95% of local AI use cases.
Buy the 5090 if: You want Q5-Q6 on 30B models (noticeable quality bump), run larger video generation, or frequently push VRAM limits.
Quick Start
# Best overall (MoE, frontier quality)
ollama run gemma4:26b
# Best for coding
ollama run qwen3-coder:30b-a3b
# Best for reasoning
ollama run deepseek-r1:32b