What AI models can the RTX 5090 run?

The RTX 5090 with 32GB VRAM can run Llama 3 70B at Q4-Q6, DeepSeek R1 at Q4, Flux Dev at FP8 with ControlNets, Wan Video 14B natively, and LTX Video 13B at FP8. The extra 8GB over the RTX 4090 significantly reduces the need for CPU offloading.

Is the RTX 5090 worth upgrading from the RTX 4090?

If you work with video generation models like Wan 14B or need higher LLM quantization levels, yes. The jump from 24GB to 32GB eliminates offloading for many models. If you mainly run 8B-30B LLMs or SDXL, the RTX 4090 is still sufficient.

Can the RTX 5090 run DeepSeek R1?

The full 671B DeepSeek R1 does not fit on a single RTX 5090. However, DeepSeek R1 at Q4 quantization (around 28-30GB) fits within the 32GB VRAM, and the distilled variants like DeepSeek R1 Distill 32B run comfortably at high quantization.

Does the RTX 5090 support NVFP4 and NVFP8?

Yes. The RTX 5090 Blackwell architecture natively supports NVFP4 and NVFP8 quantization formats, enabling faster inference at lower precision with minimal quality loss compared to standard FP16.

March 26, 2026rtx-5090, nvidia, gpu, hardware

RTX 5090 for AI — What Can You Run? Complete Guide

Complete guide to running AI models on the NVIDIA RTX 5090 with 32GB VRAM. Covers LLMs, image generation, video generation, NVFP4/NVFP8 support, and whether upgrading from the RTX 4090 is worth it.

The NVIDIA RTX 5090 is the new flagship consumer GPU for AI workloads. With 32GB of GDDR7 VRAM, approximately 1.8 TB/s memory bandwidth, and around 200 TFLOPS of FP16 compute, it sits at the top of what you can buy without stepping into professional-tier pricing. This guide covers exactly what it can run across LLMs, image generation, and video generation.

RTX 5090 Specs at a Glance

Spec	RTX 5090
VRAM	32 GB GDDR7
Memory Bandwidth	~1.8 TB/s
FP16 Compute	~200 TFLOPS
Architecture	Blackwell
New Formats	NVFP4, NVFP8
Price	~$1,999

The headline number is 32GB. That is 33% more VRAM than the RTX 4090, and the faster GDDR7 memory means models that fit will also run faster. But the real question is what those extra 8 gigabytes unlock.

LLMs on the RTX 5090

32GB of VRAM puts the RTX 5090 in a strong position for large language models. The key improvements over 24GB are higher quantization levels and reduced offloading:

Llama 3 70B at Q4-Q6 — Q4 fits comfortably with headroom. Q6 becomes possible at around 29-30GB, which was out of reach on the 4090.
DeepSeek R1 at Q4 — The quantized version of the frontier reasoning model fits within 32GB. Expect 15-20 tokens per second.
Qwen 3 235B at Q2 — This is a tight fit. The massive MoE model at aggressive quantization pushes right up against the 32GB ceiling. Functional but with no headroom for large contexts.
Qwen 3 30B at Q8 — Runs with generous headroom. High-quality quantization with room for long context windows.
Mixtral 8x22B at Q4 — The large MoE model fits where it could not on 24GB.

For most users, the practical benefit is running 70B models at Q6 instead of Q4. That is a meaningful quality improvement — Q6 preserves noticeably more model capability than Q4, especially on reasoning and nuanced tasks.

Image Generation on the RTX 5090

32GB is where image generation stops being constrained:

Flux Dev at FP8 — Uses about 17GB, leaving 15GB for ControlNets, IP-Adapter, and multiple LoRAs loaded simultaneously. No more choosing between quality and flexibility.
Flux Dev at FP16 — At 33GB, the full-precision model does not quite fit. FP8 remains the practical choice, but the quality difference is minimal.
SDXL 1.0 and SD 3.5 — Both fit at full FP16 precision with room to spare. Complex multi-ControlNet workflows run without any VRAM pressure.
Qwen Image at FP8 — The 20B vision-language image model fits at around 22GB. This was impractical on the 4090 without aggressive offloading.

The real advantage for image workflows is not fitting the base model — the 4090 could handle most base models. It is fitting the base model plus all the auxiliary components (ControlNets, LoRAs, IP-Adapter, upscalers) simultaneously without reloading.

Video Generation on the RTX 5090

Video generation is where 32GB makes the biggest difference:

Wan Video 14B native — The full-size Wan model fits in VRAM without CPU offloading. On the 4090, this required aggressive offloading that slowed generation considerably.
LTX Video 13B at FP8 — Fits with headroom. High-quality video generation at reasonable speeds.
HunyuanVideo at FP8 — The large video model becomes practical without the offloading penalty.
FramePack, AnimateDiff, CogVideoX — All run with massive headroom.

Video generation models are the fastest-growing category in terms of VRAM requirements. The 32GB ceiling gives the RTX 5090 more future-proofing than the 4090 had when it launched.

RTX 5090 vs RTX 4090: What 8GB Unlocks

Model	RTX 4090 (24GB)	RTX 5090 (32GB)
Llama 3 70B	Q4 (tight)	Q4-Q6 (comfortable)
DeepSeek R1	Q2-Q3 (aggressive)	Q4 (usable)
Flux Dev	FP8 (base only)	FP8 + ControlNets + LoRAs
Wan Video 14B	Requires offloading	Fits natively
LTX Video 13B	FP8 (tight)	FP8 (comfortable)
Qwen Image 20B	FP8 with offloading	FP8 native

The pattern is clear: 32GB moves many models from "fits with compromises" to "fits comfortably." For power users running multiple workflows daily, this translates to less VRAM management and faster iteration.

NVFP4 and NVFP8: Blackwell's Quantization Advantage

The RTX 5090 natively supports NVFP4 and NVFP8 quantization formats in hardware. These are not the same as standard INT4/INT8 quantization — the floating-point formats preserve more dynamic range:

NVFP8 delivers near-FP16 quality at half the memory footprint. For models like Flux, this means FP8 inference is essentially lossless.
NVFP4 enables extremely compact model representations. A 70B model at NVFP4 can fit in around 18-20GB with surprisingly good quality retention.

Software support is still catching up. llama.cpp and vLLM have partial NVFP8 support. Expect full ecosystem support through 2026.

Who Should Upgrade from the RTX 4090

Upgrade if:

You regularly run Wan Video 14B or other large video generation models and the offloading overhead costs you significant time
You want to run 70B LLMs at Q6 instead of Q4 for better quality
You work with Flux plus multiple ControlNets and LoRAs simultaneously
You are buying new — the RTX 5090 is the obvious choice over a new 4090

Keep the 4090 if:

You mainly run models under 20GB (8B-30B LLMs, SDXL, smaller video models)
Your workflows already fit comfortably in 24GB
The price premium does not justify the incremental improvement for your use case

Getting Started

Use Check My Hardware to see exactly which models fit your RTX 5090 setup. Browse the model catalog filtered by 32GB VRAM to see every compatible model. For comparisons with other GPUs, use the comparison tool.