What AI models can I run on RTX 4090?

The RTX 4090 (24GB) runs models up to ~30B parameters at Q4: Qwen 3 32B, Gemma 4 31B, DeepSeek R1 32B. MoE models are the sweet spot — Gemma 4 26B-A4B fits at Q4 with frontier-class quality. For images, Flux.1 Dev runs at FP8 (~13GB). For video, LTX Video 2B runs at FP16.

What AI models can I run on RTX 5090?

The RTX 5090 (32GB) unlocks 32B models at Q5-Q6 quality and MoE models like Qwen 3.5 35B-A3B at Q4. It also runs Flux at higher quality settings and larger video generation workloads. The 8GB extra over the 4090 matters most for running models at higher quantization quality.

Is RTX 5090 worth it over RTX 4090 for AI?

The RTX 5090 (32GB) adds 8GB VRAM and ~30% more bandwidth over the RTX 4090. For LLMs, this means Q5-Q6 instead of Q4 on 30B models (noticeable quality improvement) and the ability to run some 70B MoE models. For image/video generation, the extra VRAM allows higher resolutions. Worth it if you push the limits of 24GB regularly.

Can I run 70B models on RTX 4090?

Not at full quality. A 70B dense model needs ~40GB at Q4 — way over 24GB. However, MoE models that activate fewer parameters (like Gemma 4 26B-A4B with only 3.8B active) give you quality competitive with much larger models while fitting in 24GB. For true 70B inference, consider Mac M4 Max 64GB or dual GPUs.

Best quantization for 24GB GPU?

For 30B models: Q4_K_M leaves ~5GB for context. For 14B models: Q6_K or Q8 for maximum quality with plenty of headroom. For 8B models: Q8 or even FP16 — no reason to quantize aggressively when you have the memory. The sweet spot is running the largest model that fits at Q4-Q5.

April 2, 2026gpu, vram, hardware, rtx-4090, rtx-5090, 24gb, 32gb

Best AI Models for 24GB VRAM — RTX 4090 & RTX 5090 (LLMs, Image & Video)

RTX 4090 (24GB) runs 30B models, Flux at FP8, and video generation. RTX 5090 (32GB) adds 70B MoE models. Complete guide with VRAM tables and speed estimates.

24GB and 32GB are the power tiers for local AI. The RTX 4090 (24GB) has been the gold standard for enthusiasts since 2022, and the RTX 5090 (32GB) extends that lead. These GPUs run frontier-class models, generate images with Flux, and handle video generation — all locally.

This guide ranks the best models for each GPU based on our 10-factor compatibility scoring.

Best LLMs for 24-32GB VRAM

Tier S: The Sweet Spot — MoE Models

MoE (Mixture of Experts) models are where 24-32GB GPUs truly shine. They load all expert weights into VRAM but only activate a fraction per token, giving frontier quality at fast speeds.

Model	Total Params	Active	VRAM (Q4)	Speed (4090)	Why
Gemma 4 26B-A4B	25.2B	3.8B	~15 GB	~70 tok/s	89% AIME, Apache 2.0
Qwen 3 30B-A3B	30B	3B	~17 GB	~70 tok/s	Best coding MoE
Qwen 3.5 35B-A3B	35B	3B	~20 GB	~65 tok/s	Updated Qwen MoE (5090 sweet spot)

Recommended pick: Gemma 4 26B-A4B — fits on both 24GB and 32GB at Q4, multimodal, Apache 2.0, fastest inference of the three.

ollama run gemma4:26b

Tier A: Dense 30B Models

Dense models use every parameter for every token. Slower than MoE at the same total size, but some tasks benefit from the uniform compute.

Model	Params	VRAM (Q4)	VRAM (Q6)	Best For
Gemma 4 31B	30.7B	~18 GB	~26 GB	Highest quality dense, 256K context
Qwen 3 32B	32B	~19 GB	~28 GB	Strong all-rounder
DeepSeek R1 Distill 32B	32B	~18 GB	~26 GB	Best reasoning at this size
Mistral Small 3 24B	24B	~14 GB	~20 GB	Fast, good coding

24GB GPU: These fit at Q4 with limited context headroom. 32GB GPU: These run at Q5-Q6 with room for long context — noticeable quality improvement.

Tier B: 14B at Maximum Quality

With 24GB, you can run 14B models at Q8 or even FP16 — effectively zero quality loss from quantization.

Model	Params	VRAM (Q8)	VRAM (FP16)	Notes
Qwen 3 14B	14B	~16 GB	~28 GB	Q8 on 24GB, FP16 on 32GB
Gemma 3 27B	27B	~29 GB	~54 GB	Q8 only on 32GB

Coding-Specific

Model	VRAM (Q4)	Notes
Qwen 3 Coder 30B-A3B	~17 GB	MoE, best coding efficiency
Qwen 3.5 27B	~16 GB	Strong coding dense model
DeepSeek Coder V2 236B	—	Too large for single GPU

Best Image Models for 24-32GB

Model	FP16	FP8	Best On
Flux.1 Dev	33 GB	13 GB	Both — FP8 on 24GB, FP16 tight on 32GB
Flux.1 Schnell	33 GB	13 GB	Both — fastest Flux
Flux.2 Dev	~35 GB	~14 GB	Both at FP8
SDXL 1.0	8 GB	—	Either — tons of headroom
SD 3.5 Large	~10 GB	—	Either

24GB: Flux runs perfectly at FP8 with room for ControlNet and LoRAs. SDXL runs at FP16 with massive headroom for batch generation.

32GB: Flux at FP16 is possible but tight. Better to use FP8 and allocate extra VRAM to larger batch sizes or higher resolutions.

Best Video Models for 24-32GB

Model	FP16	FP8	Fits 24GB?	Fits 32GB?
LTX Video 2B	14-22 GB	~8-13 GB	Yes (FP16 at 768p)	Yes (FP16 at 720p)
AnimateDiff	6-24 GB	—	Yes (short clips)	Yes
Wan Video 1.3B	9-13 GB	~6 GB	Yes	Yes
CogVideoX 5B	25-36 GB	~16 GB	FP8 only	FP8 comfortably
HunyuanVideo	47-58 GB	~8 GB*	FP8+tiling	FP8+tiling

*HunyuanVideo at FP8 with temporal tiling can fit on 24GB but generation is slow. The 32GB GPU gives more room for longer clips.

RTX 4090 vs RTX 5090 — Which to Buy?

Spec	RTX 4090	RTX 5090
VRAM	24 GB GDDR6X	32 GB GDDR7
Bandwidth	1,008 GB/s	~1,792 GB/s
Price	~$1,600	~$2,000
Best 30B quant	Q4_K_M	Q5_K_M-Q6_K
Flux precision	FP8	FP8 (headroom)

Buy the 4090 if: Budget is a concern and Q4 quality is acceptable. It handles 95% of local AI use cases.

Buy the 5090 if: You want Q5-Q6 on 30B models (noticeable quality bump), run larger video generation, or frequently push VRAM limits.

Quick Start

# Best overall (MoE, frontier quality)
ollama run gemma4:26b

# Best for coding
ollama run qwen3-coder:30b-a3b

# Best for reasoning
ollama run deepseek-r1:32b