RTX 5090 for AI — What Can You Run? Complete Guide
Complete guide to running AI models on the NVIDIA RTX 5090 with 32GB VRAM. Covers LLMs, image generation, video generation, NVFP4/NVFP8 support, and whether upgrading from the RTX 4090 is worth it.
The NVIDIA RTX 5090 is the new flagship consumer GPU for AI workloads. With 32GB of GDDR7 VRAM, approximately 1.8 TB/s memory bandwidth, and around 200 TFLOPS of FP16 compute, it sits at the top of what you can buy without stepping into professional-tier pricing. This guide covers exactly what it can run across LLMs, image generation, and video generation.
RTX 5090 Specs at a Glance
| Spec | RTX 5090 |
|---|---|
| VRAM | 32 GB GDDR7 |
| Memory Bandwidth | ~1.8 TB/s |
| FP16 Compute | ~200 TFLOPS |
| Architecture | Blackwell |
| New Formats | NVFP4, NVFP8 |
| Price | ~$1,999 |
The headline number is 32GB. That is 33% more VRAM than the RTX 4090, and the faster GDDR7 memory means models that fit will also run faster. But the real question is what those extra 8 gigabytes unlock.
LLMs on the RTX 5090
32GB of VRAM puts the RTX 5090 in a strong position for large language models. The key improvements over 24GB are higher quantization levels and reduced offloading:
- Llama 3 70B at Q4-Q6 — Q4 fits comfortably with headroom. Q6 becomes possible at around 29-30GB, which was out of reach on the 4090.
- DeepSeek R1 at Q4 — The quantized version of the frontier reasoning model fits within 32GB. Expect 15-20 tokens per second.
- Qwen 3 235B at Q2 — This is a tight fit. The massive MoE model at aggressive quantization pushes right up against the 32GB ceiling. Functional but with no headroom for large contexts.
- Qwen 3 30B at Q8 — Runs with generous headroom. High-quality quantization with room for long context windows.
- Mixtral 8x22B at Q4 — The large MoE model fits where it could not on 24GB.
For most users, the practical benefit is running 70B models at Q6 instead of Q4. That is a meaningful quality improvement — Q6 preserves noticeably more model capability than Q4, especially on reasoning and nuanced tasks.
Image Generation on the RTX 5090
32GB is where image generation stops being constrained:
- Flux Dev at FP8 — Uses about 17GB, leaving 15GB for ControlNets, IP-Adapter, and multiple LoRAs loaded simultaneously. No more choosing between quality and flexibility.
- Flux Dev at FP16 — At 33GB, the full-precision model does not quite fit. FP8 remains the practical choice, but the quality difference is minimal.
- SDXL 1.0 and SD 3.5 — Both fit at full FP16 precision with room to spare. Complex multi-ControlNet workflows run without any VRAM pressure.
- Qwen Image at FP8 — The 20B vision-language image model fits at around 22GB. This was impractical on the 4090 without aggressive offloading.
The real advantage for image workflows is not fitting the base model — the 4090 could handle most base models. It is fitting the base model plus all the auxiliary components (ControlNets, LoRAs, IP-Adapter, upscalers) simultaneously without reloading.
Video Generation on the RTX 5090
Video generation is where 32GB makes the biggest difference:
- Wan Video 14B native — The full-size Wan model fits in VRAM without CPU offloading. On the 4090, this required aggressive offloading that slowed generation considerably.
- LTX Video 13B at FP8 — Fits with headroom. High-quality video generation at reasonable speeds.
- HunyuanVideo at FP8 — The large video model becomes practical without the offloading penalty.
- FramePack, AnimateDiff, CogVideoX — All run with massive headroom.
Video generation models are the fastest-growing category in terms of VRAM requirements. The 32GB ceiling gives the RTX 5090 more future-proofing than the 4090 had when it launched.
RTX 5090 vs RTX 4090: What 8GB Unlocks
| Model | RTX 4090 (24GB) | RTX 5090 (32GB) |
|---|---|---|
| Llama 3 70B | Q4 (tight) | Q4-Q6 (comfortable) |
| DeepSeek R1 | Q2-Q3 (aggressive) | Q4 (usable) |
| Flux Dev | FP8 (base only) | FP8 + ControlNets + LoRAs |
| Wan Video 14B | Requires offloading | Fits natively |
| LTX Video 13B | FP8 (tight) | FP8 (comfortable) |
| Qwen Image 20B | FP8 with offloading | FP8 native |
The pattern is clear: 32GB moves many models from "fits with compromises" to "fits comfortably." For power users running multiple workflows daily, this translates to less VRAM management and faster iteration.
NVFP4 and NVFP8: Blackwell's Quantization Advantage
The RTX 5090 natively supports NVFP4 and NVFP8 quantization formats in hardware. These are not the same as standard INT4/INT8 quantization — the floating-point formats preserve more dynamic range:
- NVFP8 delivers near-FP16 quality at half the memory footprint. For models like Flux, this means FP8 inference is essentially lossless.
- NVFP4 enables extremely compact model representations. A 70B model at NVFP4 can fit in around 18-20GB with surprisingly good quality retention.
Software support is still catching up. llama.cpp and vLLM have partial NVFP8 support. Expect full ecosystem support through 2026.
Who Should Upgrade from the RTX 4090
Upgrade if:
- You regularly run Wan Video 14B or other large video generation models and the offloading overhead costs you significant time
- You want to run 70B LLMs at Q6 instead of Q4 for better quality
- You work with Flux plus multiple ControlNets and LoRAs simultaneously
- You are buying new — the RTX 5090 is the obvious choice over a new 4090
Keep the 4090 if:
- You mainly run models under 20GB (8B-30B LLMs, SDXL, smaller video models)
- Your workflows already fit comfortably in 24GB
- The price premium does not justify the incremental improvement for your use case
Getting Started
Use Check My Hardware to see exactly which models fit your RTX 5090 setup. Browse the model catalog filtered by 32GB VRAM to see every compatible model. For comparisons with other GPUs, use the comparison tool.
Related reading: Best GPU for AI in 2026 | VRAM Requirements for AI Models | Best GPU for Running LLMs Locally | 2x RTX 4090 LLM Guide