Will It Run AI
flux, image-generation, vram, gpu-requirements, fp8, gguf, black-forest-labs

FLUX.2 Klein 9B VRAM Requirements — FP16, FP8, GGUF Hardware Guide (2026)

FLUX.2 Klein 9B needs ~29 GB at FP16 and ~15 GB at FP8. Full VRAM table, GPU recommendations, and comparison with the 4B sibling and FLUX.2 Dev.

FLUX.2 Klein 9B sits in the middle of Black Forest Labs' image generation lineup: more powerful than the 4B sibling, dramatically more accessible than the 32B FLUX.2 Dev. At 9B parameters with a T5-XXL + CLIP-L text encoder stack (4.82B combined), it fits on a 24 GB GPU at FP16 and on a 16 GB GPU at FP8 with text encoder offloading.

This guide covers the exact FLUX.2 Klein 9B VRAM requirements at every precision level, GPU recommendations by tier, and how it compares to the rest of the FLUX.2 family.

If you want the maximum FLUX.2 quality, see FLUX.2 Dev VRAM Requirements — the 32B model delivers significantly better detail but requires ~64 GB at full precision. Klein 9B is the practical quality-efficiency balance.

FLUX.2 Klein 9B Architecture

FeatureValue
ArchitectureDiffusion Transformer (DiT)
Parameters9B (VERIFIED — catalog entry)
Text encoderT5-XXL + CLIP-L (4.82B combined, VERIFIED)
VAEStandard Flux VAE (~0.08B)
DeveloperBlack Forest Labs
Released2026-01-15 (VERIFIED)
LicenseFLUX.2 non-commercial research license
Default steps20
HuggingFaceblack-forest-labs/FLUX.2-klein-9B
FP8 variantblack-forest-labs/FLUX.2-klein-9B-fp8

FLUX.2 Klein 9B uses the same T5-XXL + CLIP-L text encoder combination as FLUX.1 Dev — a well-tested dual-encoder approach. The 9B transformer is more efficient per parameter than FLUX.1 Dev's 12B, achieving strong quality at a smaller footprint. Distilled from FLUX.2 Dev, it inherits the newer training but in a much smaller form factor.

FLUX.2 Klein 9B VRAM Requirements

All numbers reflect total VRAM usage at 1024×1024 resolution: transformer + T5-XXL + CLIP-L + VAE + activations.

PrecisionVRAM (1024×1024)VRAM (512×512)Min GPU
FP16 full~27–29 GB~22–24 GBRTX 4090 24GB (tight), RTX 5090 32GB
FP8 (transformer)~14–16 GB~12–13 GBRTX 4080 Super 16GB, RTX 4090 24GB
FP8 + T.E. offload~8–10 GB~7–9 GBRTX 4070 Ti Super 16GB, RTX 4070 12GB
GGUF Q5~12–14 GB~10–12 GBRTX 4080 Super 16GB
GGUF Q4~10–12 GB~9–11 GBRTX 4070 Ti Super 16GB, RTX 4070 12GB

Spec source: Params (9B) and T5+CLIP-L text encoder size (4.82B combined) are VERIFIED from the diffusion catalog. VRAM estimates are derived from catalog data, component-level calculation (9B × 2B/param at FP16 = ~18 GB transformer alone), and community benchmarks. Treat as reliable guidance with ±1–2 GB margin. The discovery report notes "9B ~29GB / RTX 4090" for FP16, which this table corroborates.

VRAM Breakdown by Component

Understanding where VRAM goes helps you choose the right quantization strategy:

ComponentFP16FP8Notes
DiT transformer (9B)~18 GB~9 GBPrimary quantization target
T5-XXL encoder~9.6 GB~4.8 GBCan be offloaded to CPU RAM
CLIP-L encoder~0.4 GB~0.2 GBSmall; usually kept on GPU
VAE~0.2 GB~0.2 GBAlways small
Activations (1024×1024)~1–2 GB~1–2 GBScales with resolution
Total~29 GB~15 GBWith T.E. on GPU
Total (T5 offloaded)~19 GB~10 GBT5 in system RAM

The T5-XXL text encoder (~9.6 GB at FP16) is the biggest opportunity for VRAM reduction after the transformer itself. Offloading T5 to CPU RAM is particularly effective because it is only needed during the conditioning pass — not during the main denoising loop.


GPU Recommendations

8–10 GB — Not sufficient without aggressive offloading

Klein 9B cannot run meaningfully on 8 GB cards even with maximum quantization. The DiT transformer alone at GGUF Q4 is ~9–10 GB — it fills the card before adding text encoders or activations.

Recommended alternative: FLUX.2 Klein 4B fits on 8–12 GB GPUs at FP16 and can generate at sub-10-second speed on modern consumer hardware.

12 GB — RTX 4070 12GB, RTX 4070 Super 12GB, RTX 3060 12GB

Klein 9B is accessible at 12 GB with the right configuration:

  • GGUF Q4 + T5 CPU offload: ~9–11 GB peak VRAM. Works at 1024×1024 with some headroom.
  • FP8 + T5 CPU offload: ~8–10 GB peak VRAM. Better quality than Q4.

System RAM requirement: at least 16 GB for T5 offload; 24+ GB recommended.

The RTX 4070 Super 12GB is preferred over the base RTX 4070 12GB at this tier — wider memory bandwidth makes a real difference for Flux's large transformer. Expect ~10–20 seconds per image at 1024×1024 with FP8 + T5 offload.

Verdict: Functional but requires configuration effort. If 12 GB is your limit, the 4B sibling offers less friction with similar speed.

16 GB — RTX 4060 Ti 16GB, RTX 4070 Ti Super 16GB, RTX 4080 Super 16GB

The practical minimum for Klein 9B without offloading tricks:

  • RTX 4060 Ti 16GB: FP8 transformer + T5 offload — ~9 GB peak. Slower due to 128-bit memory bus.
  • RTX 4070 Ti Super 16GB: FP8 full pipeline ~14–16 GB with T5 on GPU. Strong generation speed.
  • RTX 4080 Super 16GB: FP8 fits fully with minimal headroom at 1024×1024. ~5–8 second generation. Best 16 GB option for Klein 9B.

Verdict: 16 GB is the comfortable minimum. The RTX 4080 Super 16GB is the recommended 16 GB card — wide 256-bit bus makes Flux disproportionately faster than the 4060 Ti 16GB.

24 GB — RTX 4090, RTX 3090

Klein 9B's natural home:

  • RTX 4090 24GB: FP8 with full T5+CLIP-L on GPU (~14–16 GB peak). Full 1024×1024 at 5–8 sec/image. Or FP16 at ~27–29 GB if you use text encoder offloading for the T5.
  • RTX 3090 24GB: Same VRAM budget, slower raw compute. FP8 recommended.

At FP8, the full pipeline including text encoders fits on 24 GB with ~8–10 GB of headroom for higher resolutions, batch generation, or ControlNet addons.

Verdict: The RTX 4090 24GB runs Klein 9B at FP8 with no compromises. Fast, high-quality, commercially workable with the right license check.

32 GB+ — RTX 5090, A6000 48GB

At 32+ GB, FLUX.2 Klein 9B runs at full FP16 with room for ControlNets, IP-Adapters, and LoRA stacks simultaneously.

  • RTX 5090 32GB: FP16 full pipeline fits. Sub-5 second generation at 1024×1024.
  • A6000 48GB: Full FP16 with large headroom. Batch generation practical.

FLUX.2 Klein 9B vs 4B — When to Choose Which

FactorKlein 4BKlein 9B
VRAM (FP16)~13 GB~27–29 GB
VRAM (FP8)~7 GB~14–16 GB
Min GPU (FP8, no offload)RTX 4070 12GBRTX 4080 Super 16GB
Image qualityGoodBetter detail, text, complex scenes
Generation speed (RTX 4090)~3–5 sec~5–10 sec
LicenseApache 2.0 (commercial)FLUX.2 non-commercial research
Best forBudget setups, commercial useQuality-focused, 24 GB cards

If you have a 24 GB card and quality matters, choose Klein 9B. If you have 8–16 GB or need commercial rights, choose FLUX.2 Klein 4B.

FLUX.2 Klein 9B vs FLUX.1 Dev

AspectFLUX.1 DevFLUX.2 Klein 9B
Parameters12B DiT9B DiT
Text encodersT5-XXL + CLIP-L (same)T5-XXL + CLIP-L (same)
VRAM (FP16 full)~33 GB~27–29 GB
VRAM (FP8 full)~15 GB~14–16 GB
Generation speedSlowerFaster (fewer params)
Image qualityStrongComparable or better
LicenseFLUX.1 Dev non-commercialFLUX.2 non-commercial

Klein 9B is a meaningful upgrade from FLUX.1 Dev for the same VRAM tier — better quality, faster generation, newer training. For the previous generation VRAM numbers, see the image generation VRAM guide 2026.


Apple Silicon Macs

MacMemoryConfigEst. speed
M4 Pro 24GB24 GB unifiedFP8 or FP16 (tight)~15–25 sec/image
M4 Pro 48GB48 GB unifiedFP16 comfortably~10–20 sec/image
M4 Max 36GB36 GB unifiedFP16 with headroom~10–18 sec/image
M4 Max 48GB48 GB unifiedFP16 + ControlNets~8–15 sec/image

Apple Silicon is 2–3× slower per image than NVIDIA at equivalent precision, but the unified memory means a M4 Pro 24GB can run Klein 9B at FP16 without offloading — something no 24 GB NVIDIA card can do for FP16 (it would require ~29 GB).


Quick Setup with diffusers

import torch
from diffusers import FluxPipeline

# FP8 for RTX 4090 24GB or 16 GB cards with offload
pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.2-klein-9B",
    torch_dtype=torch.bfloat16,
)
pipe.transformer.to(torch.float8_e4m3fn)  # FP8 transformer — halves VRAM

# For 12–16 GB GPUs: also offload T5
# pipe.enable_sequential_cpu_offload()

pipe.to("cuda")

image = pipe(
    prompt="A photorealistic forest at golden hour, dappled light through ancient trees",
    num_inference_steps=20,
    guidance_scale=3.5,
    height=1024,
    width=1024,
).images[0]
image.save("output.png")

The official FP8 checkpoint is also available for loading without manual casting — useful for ComfyUI workflows.


Related Guides

Frequently Asked Questions

How much VRAM does FLUX.2 Klein 9B need?

FLUX.2 Klein 9B uses a 9B DiT transformer with the same T5-XXL + CLIP-L text encoders (4.82B combined) as FLUX.1 Dev. At full FP16, the total pipeline requires approximately 27–29 GB VRAM. FP8 quantization brings this down to roughly 14–16 GB, fitting an RTX 4090 24GB comfortably or an RTX 4080 Super 16GB with text encoder offloading. GGUF Q4 drops it to approximately 10–12 GB.

Can FLUX.2 Klein 9B run on an RTX 4090?

Yes. The RTX 4090 (24 GB) runs FLUX.2 Klein 9B at FP16 with some headroom, or FP8 with comfortable margin. FP16 requires ~27–29 GB total, which is slightly above 24 GB — use FP8 for the transformer to fit on 24 GB without offloading. FP8 at 1024x1024 generation typically takes 5–10 seconds on a 4090.

What is the difference between FLUX.2 Klein 9B and FLUX.2 Klein 4B?

FLUX.2 Klein 9B (9B parameters) delivers higher image quality than the 4B sibling, particularly on fine details, text rendering, and complex prompts. The 4B variant (~13 GB FP16, ~7 GB FP8) is the accessible option for 12–16 GB GPUs. The 9B is the quality step up for users with 24 GB cards who want better output than the 4B without the massive VRAM cost of FLUX.2 Dev (32B, ~64 GB FP16). FLUX.2 Klein 4B is Apache 2.0 licensed; the 9B uses the FLUX.2 non-commercial research license.

What is the difference between FLUX.2 Klein 9B and FLUX.1 Dev?

FLUX.1 Dev is a 12B DiT with T5-XXL + CLIP-L (~33 GB FP16). FLUX.2 Klein 9B is a newer, more efficient 9B architecture that achieves comparable or better quality despite fewer parameters, thanks to improved training. Klein 9B needs ~27–29 GB at FP16 vs ~33 GB for FLUX.1 Dev — a modest VRAM advantage. The key difference is generation speed: Klein 9B produces sub-10-second images on an RTX 4090, significantly faster than FLUX.1 Dev.

Is FLUX.2 Klein 9B commercially usable?

No. FLUX.2 Klein 9B uses the FLUX.2 non-commercial research license, which restricts commercial use. The 4B sibling (FLUX.2 Klein 4B) is the commercially usable option — it is Apache 2.0 licensed. If you need commercial rights, use FLUX.2 Klein 4B or an Apache-licensed model. Always verify the current license on the HuggingFace model card before commercial deployment.

Does FLUX.2 Klein 9B work in ComfyUI?

Yes. FLUX.2 Klein 9B uses the standard Flux pipeline compatible with ComfyUI's existing FLUX nodes. Load it the same way as FLUX.1 Dev — ComfyUI does not distinguish between Klein and Dev at the node level. FP8 checkpoint variants are available directly on HuggingFace, and GGUF versions are available through community conversions on CivitAI.