Will It Run AI
gpu, vram, hardware, rtx-4090, rtx-5090, 24gb, 32gb

Best AI Models for 24GB VRAM — RTX 4090 & RTX 5090 (LLMs, Image & Video)

RTX 4090 (24GB) runs 30B models, Flux at FP8, and video generation. RTX 5090 (32GB) adds 70B MoE models. Complete guide with VRAM tables and speed estimates.

24GB and 32GB are the power tiers for local AI. The RTX 4090 (24GB) has been the gold standard for enthusiasts since 2022, and the RTX 5090 (32GB) extends that lead. These GPUs run frontier-class models, generate images with Flux, and handle video generation — all locally.

This guide ranks the best models for each GPU based on our 10-factor compatibility scoring.

Best LLMs for 24-32GB VRAM

Tier S: The Sweet Spot — MoE Models

MoE (Mixture of Experts) models are where 24-32GB GPUs truly shine. They load all expert weights into VRAM but only activate a fraction per token, giving frontier quality at fast speeds.

ModelTotal ParamsActiveVRAM (Q4)Speed (4090)Why
Gemma 4 26B-A4B25.2B3.8B~15 GB~70 tok/s89% AIME, Apache 2.0
Qwen 3 30B-A3B30B3B~17 GB~70 tok/sBest coding MoE
Qwen 3.5 35B-A3B35B3B~20 GB~65 tok/sUpdated Qwen MoE (5090 sweet spot)

Recommended pick: Gemma 4 26B-A4B — fits on both 24GB and 32GB at Q4, multimodal, Apache 2.0, fastest inference of the three.

ollama run gemma4:26b

Tier A: Dense 30B Models

Dense models use every parameter for every token. Slower than MoE at the same total size, but some tasks benefit from the uniform compute.

ModelParamsVRAM (Q4)VRAM (Q6)Best For
Gemma 4 31B30.7B~18 GB~26 GBHighest quality dense, 256K context
Qwen 3 32B32B~19 GB~28 GBStrong all-rounder
DeepSeek R1 Distill 32B32B~18 GB~26 GBBest reasoning at this size
Mistral Small 3 24B24B~14 GB~20 GBFast, good coding

24GB GPU: These fit at Q4 with limited context headroom. 32GB GPU: These run at Q5-Q6 with room for long context — noticeable quality improvement.

Tier B: 14B at Maximum Quality

With 24GB, you can run 14B models at Q8 or even FP16 — effectively zero quality loss from quantization.

ModelParamsVRAM (Q8)VRAM (FP16)Notes
Qwen 3 14B14B~16 GB~28 GBQ8 on 24GB, FP16 on 32GB
Gemma 3 27B27B~29 GB~54 GBQ8 only on 32GB

Coding-Specific

ModelVRAM (Q4)Notes
Qwen 3 Coder 30B-A3B~17 GBMoE, best coding efficiency
Qwen 3.5 27B~16 GBStrong coding dense model
DeepSeek Coder V2 236BToo large for single GPU

Best Image Models for 24-32GB

ModelFP16FP8Best On
Flux.1 Dev33 GB13 GBBoth — FP8 on 24GB, FP16 tight on 32GB
Flux.1 Schnell33 GB13 GBBoth — fastest Flux
Flux.2 Dev~35 GB~14 GBBoth at FP8
SDXL 1.08 GBEither — tons of headroom
SD 3.5 Large~10 GBEither

24GB: Flux runs perfectly at FP8 with room for ControlNet and LoRAs. SDXL runs at FP16 with massive headroom for batch generation.

32GB: Flux at FP16 is possible but tight. Better to use FP8 and allocate extra VRAM to larger batch sizes or higher resolutions.

Best Video Models for 24-32GB

ModelFP16FP8Fits 24GB?Fits 32GB?
LTX Video 2B14-22 GB~8-13 GBYes (FP16 at 768p)Yes (FP16 at 720p)
AnimateDiff6-24 GBYes (short clips)Yes
Wan Video 1.3B9-13 GB~6 GBYesYes
CogVideoX 5B25-36 GB~16 GBFP8 onlyFP8 comfortably
HunyuanVideo47-58 GB~8 GB*FP8+tilingFP8+tiling

*HunyuanVideo at FP8 with temporal tiling can fit on 24GB but generation is slow. The 32GB GPU gives more room for longer clips.

RTX 4090 vs RTX 5090 — Which to Buy?

SpecRTX 4090RTX 5090
VRAM24 GB GDDR6X32 GB GDDR7
Bandwidth1,008 GB/s~1,792 GB/s
Price~$1,600~$2,000
Best 30B quantQ4_K_MQ5_K_M-Q6_K
Flux precisionFP8FP8 (headroom)

Buy the 4090 if: Budget is a concern and Q4 quality is acceptable. It handles 95% of local AI use cases.

Buy the 5090 if: You want Q5-Q6 on 30B models (noticeable quality bump), run larger video generation, or frequently push VRAM limits.

Quick Start

# Best overall (MoE, frontier quality)
ollama run gemma4:26b

# Best for coding
ollama run qwen3-coder:30b-a3b

# Best for reasoning
ollama run deepseek-r1:32b

Next Steps

Frequently Asked Questions

What AI models can I run on RTX 4090?

The RTX 4090 (24GB) runs models up to ~30B parameters at Q4: Qwen 3 32B, Gemma 4 31B, DeepSeek R1 32B. MoE models are the sweet spot — Gemma 4 26B-A4B fits at Q4 with frontier-class quality. For images, Flux.1 Dev runs at FP8 (~13GB). For video, LTX Video 2B runs at FP16.

What AI models can I run on RTX 5090?

The RTX 5090 (32GB) unlocks 32B models at Q5-Q6 quality and MoE models like Qwen 3.5 35B-A3B at Q4. It also runs Flux at higher quality settings and larger video generation workloads. The 8GB extra over the 4090 matters most for running models at higher quantization quality.

Is RTX 5090 worth it over RTX 4090 for AI?

The RTX 5090 (32GB) adds 8GB VRAM and ~30% more bandwidth over the RTX 4090. For LLMs, this means Q5-Q6 instead of Q4 on 30B models (noticeable quality improvement) and the ability to run some 70B MoE models. For image/video generation, the extra VRAM allows higher resolutions. Worth it if you push the limits of 24GB regularly.

Can I run 70B models on RTX 4090?

Not at full quality. A 70B dense model needs ~40GB at Q4 — way over 24GB. However, MoE models that activate fewer parameters (like Gemma 4 26B-A4B with only 3.8B active) give you quality competitive with much larger models while fitting in 24GB. For true 70B inference, consider Mac M4 Max 64GB or dual GPUs.

Best quantization for 24GB GPU?

For 30B models: Q4_K_M leaves ~5GB for context. For 14B models: Q6_K or Q8 for maximum quality with plenty of headroom. For 8B models: Q8 or even FP16 — no reason to quantize aggressively when you have the memory. The sweet spot is running the largest model that fits at Q4-Q5.