FLUX.2 Klein 9B VRAM Requirements — FP16, FP8, GGUF Hardware Guide (2026)
FLUX.2 Klein 9B needs ~29 GB at FP16 and ~15 GB at FP8. Full VRAM table, GPU recommendations, and comparison with the 4B sibling and FLUX.2 Dev.
FLUX.2 Klein 9B sits in the middle of Black Forest Labs' image generation lineup: more powerful than the 4B sibling, dramatically more accessible than the 32B FLUX.2 Dev. At 9B parameters with a T5-XXL + CLIP-L text encoder stack (4.82B combined), it fits on a 24 GB GPU at FP16 and on a 16 GB GPU at FP8 with text encoder offloading.
This guide covers the exact FLUX.2 Klein 9B VRAM requirements at every precision level, GPU recommendations by tier, and how it compares to the rest of the FLUX.2 family.
If you want the maximum FLUX.2 quality, see FLUX.2 Dev VRAM Requirements — the 32B model delivers significantly better detail but requires ~64 GB at full precision. Klein 9B is the practical quality-efficiency balance.
FLUX.2 Klein 9B Architecture
| Feature | Value |
|---|---|
| Architecture | Diffusion Transformer (DiT) |
| Parameters | 9B (VERIFIED — catalog entry) |
| Text encoder | T5-XXL + CLIP-L (4.82B combined, VERIFIED) |
| VAE | Standard Flux VAE (~0.08B) |
| Developer | Black Forest Labs |
| Released | 2026-01-15 (VERIFIED) |
| License | FLUX.2 non-commercial research license |
| Default steps | 20 |
| HuggingFace | black-forest-labs/FLUX.2-klein-9B |
| FP8 variant | black-forest-labs/FLUX.2-klein-9B-fp8 |
FLUX.2 Klein 9B uses the same T5-XXL + CLIP-L text encoder combination as FLUX.1 Dev — a well-tested dual-encoder approach. The 9B transformer is more efficient per parameter than FLUX.1 Dev's 12B, achieving strong quality at a smaller footprint. Distilled from FLUX.2 Dev, it inherits the newer training but in a much smaller form factor.
FLUX.2 Klein 9B VRAM Requirements
All numbers reflect total VRAM usage at 1024×1024 resolution: transformer + T5-XXL + CLIP-L + VAE + activations.
| Precision | VRAM (1024×1024) | VRAM (512×512) | Min GPU |
|---|---|---|---|
| FP16 full | ~27–29 GB | ~22–24 GB | RTX 4090 24GB (tight), RTX 5090 32GB |
| FP8 (transformer) | ~14–16 GB | ~12–13 GB | RTX 4080 Super 16GB, RTX 4090 24GB |
| FP8 + T.E. offload | ~8–10 GB | ~7–9 GB | RTX 4070 Ti Super 16GB, RTX 4070 12GB |
| GGUF Q5 | ~12–14 GB | ~10–12 GB | RTX 4080 Super 16GB |
| GGUF Q4 | ~10–12 GB | ~9–11 GB | RTX 4070 Ti Super 16GB, RTX 4070 12GB |
Spec source: Params (9B) and T5+CLIP-L text encoder size (4.82B combined) are VERIFIED from the diffusion catalog. VRAM estimates are derived from catalog data, component-level calculation (9B × 2B/param at FP16 = ~18 GB transformer alone), and community benchmarks. Treat as reliable guidance with ±1–2 GB margin. The discovery report notes "9B ~29GB / RTX 4090" for FP16, which this table corroborates.
VRAM Breakdown by Component
Understanding where VRAM goes helps you choose the right quantization strategy:
| Component | FP16 | FP8 | Notes |
|---|---|---|---|
| DiT transformer (9B) | ~18 GB | ~9 GB | Primary quantization target |
| T5-XXL encoder | ~9.6 GB | ~4.8 GB | Can be offloaded to CPU RAM |
| CLIP-L encoder | ~0.4 GB | ~0.2 GB | Small; usually kept on GPU |
| VAE | ~0.2 GB | ~0.2 GB | Always small |
| Activations (1024×1024) | ~1–2 GB | ~1–2 GB | Scales with resolution |
| Total | ~29 GB | ~15 GB | With T.E. on GPU |
| Total (T5 offloaded) | ~19 GB | ~10 GB | T5 in system RAM |
The T5-XXL text encoder (~9.6 GB at FP16) is the biggest opportunity for VRAM reduction after the transformer itself. Offloading T5 to CPU RAM is particularly effective because it is only needed during the conditioning pass — not during the main denoising loop.
GPU Recommendations
8–10 GB — Not sufficient without aggressive offloading
Klein 9B cannot run meaningfully on 8 GB cards even with maximum quantization. The DiT transformer alone at GGUF Q4 is ~9–10 GB — it fills the card before adding text encoders or activations.
Recommended alternative: FLUX.2 Klein 4B fits on 8–12 GB GPUs at FP16 and can generate at sub-10-second speed on modern consumer hardware.
12 GB — RTX 4070 12GB, RTX 4070 Super 12GB, RTX 3060 12GB
Klein 9B is accessible at 12 GB with the right configuration:
- GGUF Q4 + T5 CPU offload: ~9–11 GB peak VRAM. Works at 1024×1024 with some headroom.
- FP8 + T5 CPU offload: ~8–10 GB peak VRAM. Better quality than Q4.
System RAM requirement: at least 16 GB for T5 offload; 24+ GB recommended.
The RTX 4070 Super 12GB is preferred over the base RTX 4070 12GB at this tier — wider memory bandwidth makes a real difference for Flux's large transformer. Expect ~10–20 seconds per image at 1024×1024 with FP8 + T5 offload.
Verdict: Functional but requires configuration effort. If 12 GB is your limit, the 4B sibling offers less friction with similar speed.
16 GB — RTX 4060 Ti 16GB, RTX 4070 Ti Super 16GB, RTX 4080 Super 16GB
The practical minimum for Klein 9B without offloading tricks:
- RTX 4060 Ti 16GB: FP8 transformer + T5 offload — ~9 GB peak. Slower due to 128-bit memory bus.
- RTX 4070 Ti Super 16GB: FP8 full pipeline ~14–16 GB with T5 on GPU. Strong generation speed.
- RTX 4080 Super 16GB: FP8 fits fully with minimal headroom at 1024×1024. ~5–8 second generation. Best 16 GB option for Klein 9B.
Verdict: 16 GB is the comfortable minimum. The RTX 4080 Super 16GB is the recommended 16 GB card — wide 256-bit bus makes Flux disproportionately faster than the 4060 Ti 16GB.
24 GB — RTX 4090, RTX 3090
Klein 9B's natural home:
- RTX 4090 24GB: FP8 with full T5+CLIP-L on GPU (~14–16 GB peak). Full 1024×1024 at 5–8 sec/image. Or FP16 at ~27–29 GB if you use text encoder offloading for the T5.
- RTX 3090 24GB: Same VRAM budget, slower raw compute. FP8 recommended.
At FP8, the full pipeline including text encoders fits on 24 GB with ~8–10 GB of headroom for higher resolutions, batch generation, or ControlNet addons.
Verdict: The RTX 4090 24GB runs Klein 9B at FP8 with no compromises. Fast, high-quality, commercially workable with the right license check.
32 GB+ — RTX 5090, A6000 48GB
At 32+ GB, FLUX.2 Klein 9B runs at full FP16 with room for ControlNets, IP-Adapters, and LoRA stacks simultaneously.
- RTX 5090 32GB: FP16 full pipeline fits. Sub-5 second generation at 1024×1024.
- A6000 48GB: Full FP16 with large headroom. Batch generation practical.
FLUX.2 Klein 9B vs 4B — When to Choose Which
| Factor | Klein 4B | Klein 9B |
|---|---|---|
| VRAM (FP16) | ~13 GB | ~27–29 GB |
| VRAM (FP8) | ~7 GB | ~14–16 GB |
| Min GPU (FP8, no offload) | RTX 4070 12GB | RTX 4080 Super 16GB |
| Image quality | Good | Better detail, text, complex scenes |
| Generation speed (RTX 4090) | ~3–5 sec | ~5–10 sec |
| License | Apache 2.0 (commercial) | FLUX.2 non-commercial research |
| Best for | Budget setups, commercial use | Quality-focused, 24 GB cards |
If you have a 24 GB card and quality matters, choose Klein 9B. If you have 8–16 GB or need commercial rights, choose FLUX.2 Klein 4B.
FLUX.2 Klein 9B vs FLUX.1 Dev
| Aspect | FLUX.1 Dev | FLUX.2 Klein 9B |
|---|---|---|
| Parameters | 12B DiT | 9B DiT |
| Text encoders | T5-XXL + CLIP-L (same) | T5-XXL + CLIP-L (same) |
| VRAM (FP16 full) | ~33 GB | ~27–29 GB |
| VRAM (FP8 full) | ~15 GB | ~14–16 GB |
| Generation speed | Slower | Faster (fewer params) |
| Image quality | Strong | Comparable or better |
| License | FLUX.1 Dev non-commercial | FLUX.2 non-commercial |
Klein 9B is a meaningful upgrade from FLUX.1 Dev for the same VRAM tier — better quality, faster generation, newer training. For the previous generation VRAM numbers, see the image generation VRAM guide 2026.
Apple Silicon Macs
| Mac | Memory | Config | Est. speed |
|---|---|---|---|
| M4 Pro 24GB | 24 GB unified | FP8 or FP16 (tight) | ~15–25 sec/image |
| M4 Pro 48GB | 48 GB unified | FP16 comfortably | ~10–20 sec/image |
| M4 Max 36GB | 36 GB unified | FP16 with headroom | ~10–18 sec/image |
| M4 Max 48GB | 48 GB unified | FP16 + ControlNets | ~8–15 sec/image |
Apple Silicon is 2–3× slower per image than NVIDIA at equivalent precision, but the unified memory means a M4 Pro 24GB can run Klein 9B at FP16 without offloading — something no 24 GB NVIDIA card can do for FP16 (it would require ~29 GB).
Quick Setup with diffusers
import torch
from diffusers import FluxPipeline
# FP8 for RTX 4090 24GB or 16 GB cards with offload
pipe = FluxPipeline.from_pretrained(
"black-forest-labs/FLUX.2-klein-9B",
torch_dtype=torch.bfloat16,
)
pipe.transformer.to(torch.float8_e4m3fn) # FP8 transformer — halves VRAM
# For 12–16 GB GPUs: also offload T5
# pipe.enable_sequential_cpu_offload()
pipe.to("cuda")
image = pipe(
prompt="A photorealistic forest at golden hour, dappled light through ancient trees",
num_inference_steps=20,
guidance_scale=3.5,
height=1024,
width=1024,
).images[0]
image.save("output.png")
The official FP8 checkpoint is also available for loading without manual casting — useful for ComfyUI workflows.
Related Guides
- FLUX.2 Dev VRAM Requirements — the 32B flagship for 24–64 GB setups
- Image Generation VRAM Guide 2026 — full comparison table across all image models
- Best GPU for AI Image Generation — buyer guide for image generation hardware
- Diffusion Model Calculator — check FLUX.2 Klein 9B against your specific GPU instantly