How much VRAM does Flux 2 Dev need?

Flux 2 Dev at FP16 requires approximately 24GB VRAM. At FP8 precision, it drops to about 12GB. With sequential CPU offloading, you can fit it on GPUs with as little as 10GB VRAM, though generation will be slower.

What is Flux 2 Klein 4B?

Flux 2 Klein 4B is a smaller, lighter variant of the Flux 2 family with only 4 billion parameters. It needs around 10GB VRAM at FP16 and is released under the Apache 2.0 license, making it suitable for commercial use.

Is Flux 2 better than Flux 1?

Flux 2 delivers improved image quality, especially in text rendering, fine details, and prompt adherence. It shares the same DiT architecture base as Flux 1 but benefits from refined training. The VRAM footprint is comparable.

Can I run Flux 2 on an RTX 4090?

Yes. The RTX 4090 (24GB) runs Flux 2 Dev at FP16 natively, and at FP8 with plenty of headroom for ControlNets and LoRAs. It is the ideal consumer GPU for Flux 2.

March 26, 2026flux-2, image-generation, hardware, gpu

How to Run Flux 2 Locally — Hardware Requirements & Setup Guide

Complete guide to running Flux 2 Dev and Flux 2 Klein 4B on local hardware. VRAM requirements, ComfyUI workflows, diffusers code, FP8 optimization, and comparison with Flux 1.

Flux 2 is the latest generation of image models from Black Forest Labs, building on the widely adopted Flux 1 architecture. If you have been running Flux 1 locally, upgrading to Flux 2 is straightforward. If you are new to local image generation, this guide covers everything you need to get started.

What Changed from Flux 1 to Flux 2

Flux 2 Dev shares the same DiT (Diffusion Transformer) architecture foundation as Flux 1, but Black Forest Labs refined the training process to deliver noticeable improvements:

Better text rendering — text in generated images is more consistent and legible
Improved fine detail — hair, fabric textures, and small objects are sharper
Stronger prompt adherence — complex multi-subject prompts produce more accurate results
Same inference pipeline — if your workflow ran Flux 1, it runs Flux 2 with minimal changes

The model weight structure is compatible with existing Flux tooling, so ComfyUI workflows and diffusers scripts need only a model path swap.

Hardware Requirements

Flux 2 Dev has a slightly more efficient memory profile than Flux 1 Dev at comparable quality levels:

Configuration	VRAM Required	Speed (RTX 4090)	Quality
FP16 (full precision)	~24 GB	10 sec/image	Best
FP8	~12 GB	8 sec/image	Near-best
FP8 + sequential offload	~8 GB	25+ sec/image	Near-best
FP16 + sequential offload	~10 GB	30+ sec/image	Best

Flux 2 Klein 4B is the lightweight alternative:

Configuration	VRAM Required	Speed (RTX 4090)	Quality
FP16	~10 GB	5 sec/image	Good
FP8	~6 GB	4 sec/image	Good

Klein 4B is released under the Apache 2.0 license, making it the go-to option for commercial projects that need a permissive license without the VRAM demands of the full Dev model.

Recommended GPUs:

RTX 4090 / RTX 5090 (24-32GB): Run Flux 2 Dev at FP16 natively. Best experience.
RTX 4080 / RTX 4070 Ti Super (16GB): FP8 fits comfortably with room for ControlNets.
RTX 4060 Ti 16GB / RTX 3060 12GB: FP8 with tight margins, or Klein 4B at FP16.
RTX 4060 8GB: Klein 4B at FP8 only. Full Dev model requires heavy offloading.
Apple Silicon (32GB+ unified): FP16 with MPS backend works well on M4 Max and above.

Running Flux 2 with ComfyUI

ComfyUI remains the recommended runtime for Flux 2. The workflow setup is nearly identical to Flux 1.

Quick Start

# If you already have ComfyUI installed, just update it
cd ComfyUI
git pull

# Start ComfyUI
python main.py

Downloading Model Files

Place the Flux 2 Dev model in your existing ComfyUI directory structure:

ComfyUI/
  models/
    diffusion_models/   # Flux 2 transformer weights
    clip/               # T5-XXL and CLIP-L text encoders (same as Flux 1)
    vae/                # Flux VAE (same as Flux 1)

The text encoders and VAE from Flux 1 are compatible with Flux 2. You only need to download the new transformer weights.

Workflow Configuration

Load the Flux workflow from the ComfyUI gallery, then:

Point the UnetLoader node to the Flux 2 Dev weights
Keep your existing T5-XXL and CLIP-L text encoder configuration
Keep your existing Flux VAE
Set resolution to 1024x1024 and steps to 28

For Klein 4B, use the same workflow but point to the Klein model file and reduce steps to 20.

Running Flux 2 with Diffusers (Python)

Flux 2 Dev at FP16

import torch
from diffusers import FluxPipeline

pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.2-dev",
    torch_dtype=torch.float16
)
pipe.to("cuda")

image = pipe(
    prompt="A photorealistic coastal village at sunset, boats in harbor",
    num_inference_steps=28,
    guidance_scale=3.5,
    width=1024,
    height=1024,
).images[0]
image.save("flux2_output.png")

FP8 for Lower VRAM (~12GB)

import torch
from diffusers import FluxPipeline

pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.2-dev",
    torch_dtype=torch.float16
)
pipe.transformer.to(torch.float8_e4m3fn)
pipe.to("cuda")

image = pipe(
    prompt="A photorealistic coastal village at sunset, boats in harbor",
    num_inference_steps=28,
    guidance_scale=3.5,
).images[0]

Sequential Offloading for Tight VRAM

import torch
from diffusers import FluxPipeline

pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.2-dev",
    torch_dtype=torch.float16
)
pipe.enable_sequential_cpu_offload()

image = pipe(
    prompt="A photorealistic coastal village at sunset, boats in harbor",
    num_inference_steps=28,
    guidance_scale=3.5,
).images[0]

Sequential offloading moves components between CPU and GPU during inference. It fits on GPUs with limited VRAM but adds significant latency.

VRAM Optimization Strategies

FP8 precision is the single most effective optimization. It halves VRAM usage with minimal quality loss. For Flux 2 Dev, this brings the requirement from 24GB down to 12GB — fitting on mainstream GPUs like the RTX 3060 12GB.

Sequential CPU offloading trades speed for VRAM. Each model component (text encoder, transformer, VAE) loads into VRAM only when needed, then offloads back to system RAM. Useful when FP8 alone is not enough.

Klein 4B for constrained hardware. If your GPU has 8-10GB VRAM, Klein 4B at FP8 (~6GB) gives you a genuine Flux 2 experience without extreme optimization. The quality gap versus Dev is noticeable but the model is still highly capable for most use cases.

torch.compile for speed. On NVIDIA GPUs with PyTorch 2.0+, compiling the transformer accelerates repeated generations by 20-30%:

pipe.transformer = torch.compile(
    pipe.transformer,
    mode="reduce-overhead",
    fullgraph=True
)

Flux 2 Klein 4B — The Lightweight Option

Klein 4B deserves special attention. At just 4 billion parameters, it is roughly one-third the size of Flux 2 Dev, which translates directly to lower VRAM requirements and faster generation times.

Key advantages of Klein 4B:

Apache 2.0 license — free for commercial use, no restrictions
~10GB VRAM at FP16 — fits on most modern GPUs without optimization
~6GB at FP8 — runs on budget GPUs like the RTX 4060 8GB
Fast generation — under 5 seconds per image on an RTX 4090

The tradeoff is quality. Klein 4B produces good results for most general-purpose use cases, but falls short of Dev on complex scenes, photorealistic faces, and detailed text rendering.

Summary

Flux 2 is a meaningful upgrade over Flux 1 with improved quality across the board, and the hardware requirements are manageable for most local setups.

24GB+ VRAM: Flux 2 Dev at FP16, full quality, best experience
12-16GB VRAM: Flux 2 Dev at FP8, near-best quality
8-10GB VRAM: Klein 4B at FP16 or Dev with heavy offloading
Under 8GB: Klein 4B at FP8, or consider SDXL

Check if Flux 2 Dev fits your hardware | Diffusion Calculator | Flux 2 Dev vs Flux 1 Dev