Will It Run AI
flux, image-generation, comfyui, tutorial, gpu

How to Run Flux Locally — Complete Hardware & Setup Guide

Step-by-step guide to running Flux.1 Dev and Schnell on local hardware. Hardware requirements, ComfyUI setup, diffusers code, GGUF quantization, ControlNet support, and performance optimization tips.

Flux.1 is the current quality leader for local image generation. Getting it running on your hardware requires understanding the available precision options, runtime choices, and optimization techniques. This guide walks through everything from hardware requirements to optimized workflows.


What Is Flux.1?

Flux.1 is a text-to-image model from Black Forest Labs, built by the original creators of Stable Diffusion. It uses a 12B parameter DiT (Diffusion Transformer) architecture with a T5-XXL text encoder (4.7B parameters) and CLIP-L encoder.

There are two main variants:

VariantStepsSpeedLicenseQuality
Flux.1 Dev28BaselineNon-commercialBest
Flux.1 Schnell4~7x fasterApache 2.0Very good

Both share the same architecture and VRAM footprint. The difference is purely in generation speed and licensing.


Hardware Requirements

Flux's VRAM needs vary significantly depending on precision and optimization:

ConfigurationVRAM RequiredSpeed (RTX 4090)Quality
FP16 (full precision)33 GB12 sec/imageBest
FP817 GB10 sec/imageNear-best
GGUF Q812.7 GB~14 sec/imageVery good
GGUF Q6_K9.9 GB~15 sec/imageGood
GGUF Q4_K_S6.8 GB~18 sec/imageAcceptable
FP16 + sequential offload12.5 GB30+ sec/imageBest

Recommended GPUs:

  • RTX 4090 / RTX 5090 (24-32GB): Run FP8 natively with room for ControlNets. The ideal Flux experience.
  • RTX 4070 Ti Super / RTX 4080 (16GB): GGUF Q6-Q8 for good quality. FP8 is too tight with overhead.
  • RTX 4070 (12GB): GGUF Q4-Q5 or sequential offloading. Workable but slower.
  • RTX 4060 (8GB): GGUF Q4 only, very tight. Consider SDXL instead.
  • Apple Silicon (32GB+ unified): FP16 with MPS backend. Works well on M4 Max and M4 Ultra.

Method 1: Running Flux with ComfyUI

ComfyUI is the recommended runtime for Flux. It offers node-based workflows with fine-grained control over every aspect of generation.

Installation

# Clone ComfyUI
git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI

# Install dependencies
pip install -r requirements.txt

# Start ComfyUI
python main.py

Downloading the Model

Place model files in the ComfyUI directory structure:

ComfyUI/
  models/
    diffusion_models/   # Flux transformer (GGUF or safetensors)
    clip/               # T5-XXL and CLIP-L text encoders
    vae/                # Flux VAE

For GGUF quantized versions (recommended for most users):

# Download GGUF Q4 (~6.8GB) from city96
huggingface-cli download city96/FLUX.1-dev-gguf flux1-dev-Q4_K_S.gguf \
  --local-dir ComfyUI/models/diffusion_models/

# Download text encoders
huggingface-cli download comfyanonymous/flux_text_encoders \
  t5xxl_fp8_e4m3fn.safetensors clip_l.safetensors \
  --local-dir ComfyUI/models/clip/

# Download VAE
huggingface-cli download black-forest-labs/FLUX.1-dev ae.safetensors \
  --local-dir ComfyUI/models/vae/

Workflow Setup

ComfyUI ships with built-in Flux workflows. Load the default Flux workflow from the workflow gallery, then configure:

  1. Set the UnetLoader node to your GGUF or safetensors model file
  2. Set the CLIPLoader nodes to your T5-XXL and CLIP-L files
  3. Set the VAELoader to the Flux VAE
  4. Adjust resolution (1024x1024 recommended) and steps (28 for Dev, 4 for Schnell)

Method 2: Running Flux with Diffusers (Python)

For programmatic use, HuggingFace diffusers provides a clean Python API.

FP16 (Full Precision — 33GB VRAM)

import torch
from diffusers import FluxPipeline

pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    torch_dtype=torch.float16
)
pipe.to("cuda")

image = pipe(
    prompt="A photorealistic mountain landscape at golden hour",
    num_inference_steps=28,
    guidance_scale=3.5,
    width=1024,
    height=1024,
).images[0]
image.save("output.png")

FP8 (Half the VRAM — 17GB)

import torch
from diffusers import FluxPipeline

pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    torch_dtype=torch.float16
)
pipe.transformer.to(torch.float8_e4m3fn)
pipe.to("cuda")

image = pipe(
    prompt="A photorealistic mountain landscape at golden hour",
    num_inference_steps=28,
    guidance_scale=3.5,
).images[0]

Sequential Offloading (12GB VRAM)

import torch
from diffusers import FluxPipeline

pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    torch_dtype=torch.float16
)
pipe.enable_sequential_cpu_offload()

image = pipe(
    prompt="A photorealistic mountain landscape at golden hour",
    num_inference_steps=28,
    guidance_scale=3.5,
).images[0]

Sequential offloading moves model components between CPU RAM and GPU VRAM during inference. It fits on 12GB but is significantly slower — expect 30+ seconds per image on an RTX 4090 versus 10-12 seconds without offloading.


GGUF Quantized Versions

The city96 GGUF quantizations are the most practical way to run Flux on consumer GPUs. Available on HuggingFace:

QuantFile SizeVRAMQuality
Q2_K4.0 GB~6 GBSignificant loss
Q3_K_S5.2 GB~7 GBNoticeable loss
Q4_06.8 GB~9 GBAcceptable
Q4_K_S6.8 GB~9 GBAcceptable
Q5_K_S8.3 GB~10 GBGood
Q6_K9.9 GB~12 GBVery good
Q8_012.7 GB~15 GBExcellent

Sources:

GGUF quantization works through ComfyUI with the GGUF loader nodes. Quality at Q4-Q5 is surprisingly close to FP8 for most prompts. Below Q4, text rendering and fine details degrade noticeably.


ControlNet Support

Flux.1 Dev has three ControlNet models from InstantX, each adding approximately 3.6GB VRAM:

ControlNetPurposeHuggingFace
Canny EdgeStructural guidance from edge detectionInstantX/FLUX.1-dev-Controlnet-Canny
Depth Map3D spatial control from depth estimationInstantX/FLUX.1-dev-Controlnet-Depth
Union (Multi)Combined canny, depth, pose, tile, blurInstantX/FLUX.1-dev-Controlnet-Union

The Union model is the most versatile — it handles multiple control types in a single model, saving VRAM compared to loading separate ControlNets.

ControlNets are not currently supported for Flux.1 Schnell due to its distilled pipeline.

VRAM budget with ControlNet:

  • FP8 + ControlNet: ~21GB (fits RTX 4090)
  • GGUF Q6 + ControlNet: ~16GB (fits RTX 4080)
  • GGUF Q4 + ControlNet: ~13GB (fits RTX 4070 Ti Super)

Performance Tips

Use torch.compile for Faster Inference

On NVIDIA GPUs with PyTorch 2.0+, torch.compile can speed up inference by 20-30%:

pipe.transformer = torch.compile(
    pipe.transformer,
    mode="reduce-overhead",
    fullgraph=True
)

The first generation will be slower due to compilation, but subsequent generations are faster. Worth it for batch generation workflows.

Use Schnell for Iteration, Dev for Finals

A productive workflow: use Flux.1 Schnell at 4 steps to quickly explore compositions and prompts, then switch to Flux.1 Dev at 28 steps for your final image. Schnell is roughly 7x faster, making it ideal for the creative exploration phase.

Resolution Matters

Flux works best at 1024x1024. You can generate at lower resolutions (768x768, 512x512) for faster iteration, but the model was trained primarily on 1024px images. Going above 1024x1024 can cause artifacts without specific high-resolution techniques.

FP8 Is Usually Enough

The quality difference between FP16 and FP8 is minimal for most use cases. Unless you are doing professional work where subtle detail differences matter, FP8 saves you 16GB of VRAM with negligible quality loss. Start with FP8 and only move to FP16 if you notice issues.


Troubleshooting

Out of memory errors: Try GGUF quantization first (Q4-Q6). If still tight, enable sequential offloading. Reduce resolution to 768x768 as a last resort.

Slow generation: Check that your model is on GPU, not CPU. Verify CUDA is available (torch.cuda.is_available()). Close other GPU-consuming applications. Consider using Schnell instead of Dev.

Black or corrupted images: Usually a precision mismatch. Ensure your VAE and text encoders are loaded at compatible precision. FP16 VAE with FP8 transformer works; mixing FP32 and FP16 components can cause issues.

Text encoder loading fails: T5-XXL is large (9.5GB at FP16). Use the FP8 T5 encoder (t5xxl_fp8_e4m3fn.safetensors) to save memory. It has negligible quality impact.


Summary

Flux.1 is accessible on a wider range of hardware than its 33GB headline number suggests. With FP8 precision and GGUF quantization, you can run it on GPUs starting at 12GB VRAM.

  • 24GB+ VRAM: FP8 natively, ControlNets available, best experience
  • 16GB VRAM: GGUF Q6-Q8, very good quality
  • 12GB VRAM: GGUF Q4-Q5 or sequential offloading, acceptable quality
  • 8GB VRAM: Consider SDXL instead — better experience on limited hardware

Check if Flux fits your hardware | Compare Flux vs SDXL vs SD 3.5


Related reading: Best Local Image Generation Models | Flux vs SDXL vs SD 3.5 | GGUF Quantization Explained

Frequently Asked Questions

How much VRAM do I need to run Flux locally?

Flux.1 Dev at FP16 needs 33GB VRAM. At FP8, it needs about 17GB. GGUF quantized versions (Q4) bring it down to roughly 12GB. With sequential CPU offloading, you can run it on as little as 12GB VRAM, but generation will be slower.

Can I run Flux on an RTX 4090?

Yes. The RTX 4090 (24GB) runs Flux.1 Dev at FP8 precision with room to spare — about 17GB VRAM usage. You can also run GGUF Q4-Q6 versions comfortably. For FP16, you would need sequential offloading, which fits but is slower.

What is the difference between Flux Dev and Flux Schnell?

Flux.1 Dev generates images in 28 steps with the highest quality. Flux.1 Schnell is a distilled version that generates in just 4 steps — roughly 7x faster — with slightly lower quality. Schnell also has an Apache 2.0 license (commercial use allowed), while Dev is non-commercial only.

Can I use ControlNet with Flux?

Yes. Flux.1 Dev has ControlNets for canny edge, depth map, and a union multi-control model from InstantX. Each adds about 3.6GB to VRAM requirements. ControlNets are not currently available for Flux.1 Schnell.

How do I reduce Flux VRAM usage?

Three main approaches: (1) Use FP8 precision instead of FP16, cutting VRAM roughly in half. (2) Use GGUF quantized versions from city96 on HuggingFace. (3) Enable sequential CPU offloading in diffusers, which loads model components one at a time.