How much VRAM do I need to run Flux locally?

Flux.1 Dev at FP16 needs 33GB VRAM. At FP8, it needs about 17GB. GGUF quantized versions (Q4) bring it down to roughly 12GB. With sequential CPU offloading, you can run it on as little as 12GB VRAM, but generation will be slower.

Can I run Flux on an RTX 4090?

Yes. The RTX 4090 (24GB) runs Flux.1 Dev at FP8 precision with room to spare — about 17GB VRAM usage. You can also run GGUF Q4-Q6 versions comfortably. For FP16, you would need sequential offloading, which fits but is slower.

What is the difference between Flux Dev and Flux Schnell?

Flux.1 Dev generates images in 28 steps with the highest quality. Flux.1 Schnell is a distilled version that generates in just 4 steps — roughly 7x faster — with slightly lower quality. Schnell also has an Apache 2.0 license (commercial use allowed), while Dev is non-commercial only.

Can I use ControlNet with Flux?

Yes. Flux.1 Dev has ControlNets for canny edge, depth map, and a union multi-control model from InstantX. Each adds about 3.6GB to VRAM requirements. ControlNets are not currently available for Flux.1 Schnell.

How do I reduce Flux VRAM usage?

Three main approaches: (1) Use FP8 precision instead of FP16, cutting VRAM roughly in half. (2) Use GGUF quantized versions from city96 on HuggingFace. (3) Enable sequential CPU offloading in diffusers, which loads model components one at a time.

March 25, 2026flux, image-generation, comfyui, tutorial, gpu

How to Run Flux Locally — Complete Hardware & Setup Guide

Step-by-step guide to running Flux.1 Dev and Schnell on local hardware. Hardware requirements, ComfyUI setup, diffusers code, GGUF quantization, ControlNet support, and performance optimization tips.

Flux.1 is the current quality leader for local image generation. Getting it running on your hardware requires understanding the available precision options, runtime choices, and optimization techniques. This guide walks through everything from hardware requirements to optimized workflows.

What Is Flux.1?

Flux.1 is a text-to-image model from Black Forest Labs, built by the original creators of Stable Diffusion. It uses a 12B parameter DiT (Diffusion Transformer) architecture with a T5-XXL text encoder (4.7B parameters) and CLIP-L encoder.

There are two main variants:

Variant	Steps	Speed	License	Quality
Flux.1 Dev	28	Baseline	Non-commercial	Best
Flux.1 Schnell	4	~7x faster	Apache 2.0	Very good

Both share the same architecture and VRAM footprint. The difference is purely in generation speed and licensing.

Hardware Requirements

Flux's VRAM needs vary significantly depending on precision and optimization:

Configuration	VRAM Required	Speed (RTX 4090)	Quality
FP16 (full precision)	33 GB	12 sec/image	Best
FP8	17 GB	10 sec/image	Near-best
GGUF Q8	12.7 GB	~14 sec/image	Very good
GGUF Q6_K	9.9 GB	~15 sec/image	Good
GGUF Q4_K_S	6.8 GB	~18 sec/image	Acceptable
FP16 + sequential offload	12.5 GB	30+ sec/image	Best

Recommended GPUs:

RTX 4090 / RTX 5090 (24-32GB): Run FP8 natively with room for ControlNets. The ideal Flux experience.
RTX 4070 Ti Super / RTX 4080 (16GB): GGUF Q6-Q8 for good quality. FP8 is too tight with overhead.
RTX 4070 (12GB): GGUF Q4-Q5 or sequential offloading. Workable but slower.
RTX 4060 (8GB): GGUF Q4 only, very tight. Consider SDXL instead.
Apple Silicon (32GB+ unified): FP16 with MPS backend. Works well on M4 Max and M4 Ultra.

Method 1: Running Flux with ComfyUI

ComfyUI is the recommended runtime for Flux. It offers node-based workflows with fine-grained control over every aspect of generation.

Installation

# Clone ComfyUI
git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI

# Install dependencies
pip install -r requirements.txt

# Start ComfyUI
python main.py

Downloading the Model

Place model files in the ComfyUI directory structure:

ComfyUI/
  models/
    diffusion_models/   # Flux transformer (GGUF or safetensors)
    clip/               # T5-XXL and CLIP-L text encoders
    vae/                # Flux VAE

For GGUF quantized versions (recommended for most users):

# Download GGUF Q4 (~6.8GB) from city96
huggingface-cli download city96/FLUX.1-dev-gguf flux1-dev-Q4_K_S.gguf \
  --local-dir ComfyUI/models/diffusion_models/

# Download text encoders
huggingface-cli download comfyanonymous/flux_text_encoders \
  t5xxl_fp8_e4m3fn.safetensors clip_l.safetensors \
  --local-dir ComfyUI/models/clip/

# Download VAE
huggingface-cli download black-forest-labs/FLUX.1-dev ae.safetensors \
  --local-dir ComfyUI/models/vae/

Workflow Setup

ComfyUI ships with built-in Flux workflows. Load the default Flux workflow from the workflow gallery, then configure:

Set the UnetLoader node to your GGUF or safetensors model file
Set the CLIPLoader nodes to your T5-XXL and CLIP-L files
Set the VAELoader to the Flux VAE
Adjust resolution (1024x1024 recommended) and steps (28 for Dev, 4 for Schnell)

Method 2: Running Flux with Diffusers (Python)

For programmatic use, HuggingFace diffusers provides a clean Python API.

FP16 (Full Precision — 33GB VRAM)

import torch
from diffusers import FluxPipeline

pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    torch_dtype=torch.float16
)
pipe.to("cuda")

image = pipe(
    prompt="A photorealistic mountain landscape at golden hour",
    num_inference_steps=28,
    guidance_scale=3.5,
    width=1024,
    height=1024,
).images[0]
image.save("output.png")

FP8 (Half the VRAM — 17GB)

import torch
from diffusers import FluxPipeline

pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    torch_dtype=torch.float16
)
pipe.transformer.to(torch.float8_e4m3fn)
pipe.to("cuda")

image = pipe(
    prompt="A photorealistic mountain landscape at golden hour",
    num_inference_steps=28,
    guidance_scale=3.5,
).images[0]

Sequential Offloading (12GB VRAM)

import torch
from diffusers import FluxPipeline

pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    torch_dtype=torch.float16
)
pipe.enable_sequential_cpu_offload()

image = pipe(
    prompt="A photorealistic mountain landscape at golden hour",
    num_inference_steps=28,
    guidance_scale=3.5,
).images[0]

Sequential offloading moves model components between CPU RAM and GPU VRAM during inference. It fits on 12GB but is significantly slower — expect 30+ seconds per image on an RTX 4090 versus 10-12 seconds without offloading.

GGUF Quantized Versions

The city96 GGUF quantizations are the most practical way to run Flux on consumer GPUs. Available on HuggingFace:

Quant	File Size	VRAM	Quality
Q2_K	4.0 GB	~6 GB	Significant loss
Q3_K_S	5.2 GB	~7 GB	Noticeable loss
Q4_0	6.8 GB	~9 GB	Acceptable
Q4_K_S	6.8 GB	~9 GB	Acceptable
Q5_K_S	8.3 GB	~10 GB	Good
Q6_K	9.9 GB	~12 GB	Very good
Q8_0	12.7 GB	~15 GB	Excellent

Sources:

Dev: city96/FLUX.1-dev-gguf
Schnell: city96/FLUX.1-schnell-gguf

GGUF quantization works through ComfyUI with the GGUF loader nodes. Quality at Q4-Q5 is surprisingly close to FP8 for most prompts. Below Q4, text rendering and fine details degrade noticeably.

ControlNet Support

Flux.1 Dev has three ControlNet models from InstantX, each adding approximately 3.6GB VRAM:

ControlNet	Purpose	HuggingFace
Canny Edge	Structural guidance from edge detection	InstantX/FLUX.1-dev-Controlnet-Canny
Depth Map	3D spatial control from depth estimation	InstantX/FLUX.1-dev-Controlnet-Depth
Union (Multi)	Combined canny, depth, pose, tile, blur	InstantX/FLUX.1-dev-Controlnet-Union

The Union model is the most versatile — it handles multiple control types in a single model, saving VRAM compared to loading separate ControlNets.

ControlNets are not currently supported for Flux.1 Schnell due to its distilled pipeline.

VRAM budget with ControlNet:

FP8 + ControlNet: ~21GB (fits RTX 4090)
GGUF Q6 + ControlNet: ~16GB (fits RTX 4080)
GGUF Q4 + ControlNet: ~13GB (fits RTX 4070 Ti Super)

Performance Tips

Use torch.compile for Faster Inference

On NVIDIA GPUs with PyTorch 2.0+, torch.compile can speed up inference by 20-30%:

pipe.transformer = torch.compile(
    pipe.transformer,
    mode="reduce-overhead",
    fullgraph=True
)

The first generation will be slower due to compilation, but subsequent generations are faster. Worth it for batch generation workflows.

Use Schnell for Iteration, Dev for Finals

A productive workflow: use Flux.1 Schnell at 4 steps to quickly explore compositions and prompts, then switch to Flux.1 Dev at 28 steps for your final image. Schnell is roughly 7x faster, making it ideal for the creative exploration phase.

Resolution Matters

Flux works best at 1024x1024. You can generate at lower resolutions (768x768, 512x512) for faster iteration, but the model was trained primarily on 1024px images. Going above 1024x1024 can cause artifacts without specific high-resolution techniques.

FP8 Is Usually Enough

The quality difference between FP16 and FP8 is minimal for most use cases. Unless you are doing professional work where subtle detail differences matter, FP8 saves you 16GB of VRAM with negligible quality loss. Start with FP8 and only move to FP16 if you notice issues.

Troubleshooting

Out of memory errors: Try GGUF quantization first (Q4-Q6). If still tight, enable sequential offloading. Reduce resolution to 768x768 as a last resort.

Slow generation: Check that your model is on GPU, not CPU. Verify CUDA is available (torch.cuda.is_available()). Close other GPU-consuming applications. Consider using Schnell instead of Dev.

Black or corrupted images: Usually a precision mismatch. Ensure your VAE and text encoders are loaded at compatible precision. FP16 VAE with FP8 transformer works; mixing FP32 and FP16 components can cause issues.

Text encoder loading fails: T5-XXL is large (9.5GB at FP16). Use the FP8 T5 encoder (t5xxl_fp8_e4m3fn.safetensors) to save memory. It has negligible quality impact.

Summary

Flux.1 is accessible on a wider range of hardware than its 33GB headline number suggests. With FP8 precision and GGUF quantization, you can run it on GPUs starting at 12GB VRAM.

24GB+ VRAM: FP8 natively, ControlNets available, best experience
16GB VRAM: GGUF Q6-Q8, very good quality
12GB VRAM: GGUF Q4-Q5 or sequential offloading, acceptable quality
8GB VRAM: Consider SDXL instead — better experience on limited hardware

Check if Flux fits your hardware | Compare Flux vs SDXL vs SD 3.5