How to Run Flux Locally — Complete Hardware & Setup Guide
Step-by-step guide to running Flux.1 Dev and Schnell on local hardware. Hardware requirements, ComfyUI setup, diffusers code, GGUF quantization, ControlNet support, and performance optimization tips.
Flux.1 is the current quality leader for local image generation. Getting it running on your hardware requires understanding the available precision options, runtime choices, and optimization techniques. This guide walks through everything from hardware requirements to optimized workflows.
What Is Flux.1?
Flux.1 is a text-to-image model from Black Forest Labs, built by the original creators of Stable Diffusion. It uses a 12B parameter DiT (Diffusion Transformer) architecture with a T5-XXL text encoder (4.7B parameters) and CLIP-L encoder.
There are two main variants:
| Variant | Steps | Speed | License | Quality |
|---|---|---|---|---|
| Flux.1 Dev | 28 | Baseline | Non-commercial | Best |
| Flux.1 Schnell | 4 | ~7x faster | Apache 2.0 | Very good |
Both share the same architecture and VRAM footprint. The difference is purely in generation speed and licensing.
Hardware Requirements
Flux's VRAM needs vary significantly depending on precision and optimization:
| Configuration | VRAM Required | Speed (RTX 4090) | Quality |
|---|---|---|---|
| FP16 (full precision) | 33 GB | 12 sec/image | Best |
| FP8 | 17 GB | 10 sec/image | Near-best |
| GGUF Q8 | 12.7 GB | ~14 sec/image | Very good |
| GGUF Q6_K | 9.9 GB | ~15 sec/image | Good |
| GGUF Q4_K_S | 6.8 GB | ~18 sec/image | Acceptable |
| FP16 + sequential offload | 12.5 GB | 30+ sec/image | Best |
Recommended GPUs:
- RTX 4090 / RTX 5090 (24-32GB): Run FP8 natively with room for ControlNets. The ideal Flux experience.
- RTX 4070 Ti Super / RTX 4080 (16GB): GGUF Q6-Q8 for good quality. FP8 is too tight with overhead.
- RTX 4070 (12GB): GGUF Q4-Q5 or sequential offloading. Workable but slower.
- RTX 4060 (8GB): GGUF Q4 only, very tight. Consider SDXL instead.
- Apple Silicon (32GB+ unified): FP16 with MPS backend. Works well on M4 Max and M4 Ultra.
Method 1: Running Flux with ComfyUI
ComfyUI is the recommended runtime for Flux. It offers node-based workflows with fine-grained control over every aspect of generation.
Installation
# Clone ComfyUI
git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI
# Install dependencies
pip install -r requirements.txt
# Start ComfyUI
python main.py
Downloading the Model
Place model files in the ComfyUI directory structure:
ComfyUI/
models/
diffusion_models/ # Flux transformer (GGUF or safetensors)
clip/ # T5-XXL and CLIP-L text encoders
vae/ # Flux VAE
For GGUF quantized versions (recommended for most users):
# Download GGUF Q4 (~6.8GB) from city96
huggingface-cli download city96/FLUX.1-dev-gguf flux1-dev-Q4_K_S.gguf \
--local-dir ComfyUI/models/diffusion_models/
# Download text encoders
huggingface-cli download comfyanonymous/flux_text_encoders \
t5xxl_fp8_e4m3fn.safetensors clip_l.safetensors \
--local-dir ComfyUI/models/clip/
# Download VAE
huggingface-cli download black-forest-labs/FLUX.1-dev ae.safetensors \
--local-dir ComfyUI/models/vae/
Workflow Setup
ComfyUI ships with built-in Flux workflows. Load the default Flux workflow from the workflow gallery, then configure:
- Set the UnetLoader node to your GGUF or safetensors model file
- Set the CLIPLoader nodes to your T5-XXL and CLIP-L files
- Set the VAELoader to the Flux VAE
- Adjust resolution (1024x1024 recommended) and steps (28 for Dev, 4 for Schnell)
Method 2: Running Flux with Diffusers (Python)
For programmatic use, HuggingFace diffusers provides a clean Python API.
FP16 (Full Precision — 33GB VRAM)
import torch
from diffusers import FluxPipeline
pipe = FluxPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev",
torch_dtype=torch.float16
)
pipe.to("cuda")
image = pipe(
prompt="A photorealistic mountain landscape at golden hour",
num_inference_steps=28,
guidance_scale=3.5,
width=1024,
height=1024,
).images[0]
image.save("output.png")
FP8 (Half the VRAM — 17GB)
import torch
from diffusers import FluxPipeline
pipe = FluxPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev",
torch_dtype=torch.float16
)
pipe.transformer.to(torch.float8_e4m3fn)
pipe.to("cuda")
image = pipe(
prompt="A photorealistic mountain landscape at golden hour",
num_inference_steps=28,
guidance_scale=3.5,
).images[0]
Sequential Offloading (12GB VRAM)
import torch
from diffusers import FluxPipeline
pipe = FluxPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev",
torch_dtype=torch.float16
)
pipe.enable_sequential_cpu_offload()
image = pipe(
prompt="A photorealistic mountain landscape at golden hour",
num_inference_steps=28,
guidance_scale=3.5,
).images[0]
Sequential offloading moves model components between CPU RAM and GPU VRAM during inference. It fits on 12GB but is significantly slower — expect 30+ seconds per image on an RTX 4090 versus 10-12 seconds without offloading.
GGUF Quantized Versions
The city96 GGUF quantizations are the most practical way to run Flux on consumer GPUs. Available on HuggingFace:
| Quant | File Size | VRAM | Quality |
|---|---|---|---|
| Q2_K | 4.0 GB | ~6 GB | Significant loss |
| Q3_K_S | 5.2 GB | ~7 GB | Noticeable loss |
| Q4_0 | 6.8 GB | ~9 GB | Acceptable |
| Q4_K_S | 6.8 GB | ~9 GB | Acceptable |
| Q5_K_S | 8.3 GB | ~10 GB | Good |
| Q6_K | 9.9 GB | ~12 GB | Very good |
| Q8_0 | 12.7 GB | ~15 GB | Excellent |
Sources:
- Dev: city96/FLUX.1-dev-gguf
- Schnell: city96/FLUX.1-schnell-gguf
GGUF quantization works through ComfyUI with the GGUF loader nodes. Quality at Q4-Q5 is surprisingly close to FP8 for most prompts. Below Q4, text rendering and fine details degrade noticeably.
ControlNet Support
Flux.1 Dev has three ControlNet models from InstantX, each adding approximately 3.6GB VRAM:
| ControlNet | Purpose | HuggingFace |
|---|---|---|
| Canny Edge | Structural guidance from edge detection | InstantX/FLUX.1-dev-Controlnet-Canny |
| Depth Map | 3D spatial control from depth estimation | InstantX/FLUX.1-dev-Controlnet-Depth |
| Union (Multi) | Combined canny, depth, pose, tile, blur | InstantX/FLUX.1-dev-Controlnet-Union |
The Union model is the most versatile — it handles multiple control types in a single model, saving VRAM compared to loading separate ControlNets.
ControlNets are not currently supported for Flux.1 Schnell due to its distilled pipeline.
VRAM budget with ControlNet:
- FP8 + ControlNet: ~21GB (fits RTX 4090)
- GGUF Q6 + ControlNet: ~16GB (fits RTX 4080)
- GGUF Q4 + ControlNet: ~13GB (fits RTX 4070 Ti Super)
Performance Tips
Use torch.compile for Faster Inference
On NVIDIA GPUs with PyTorch 2.0+, torch.compile can speed up inference by 20-30%:
pipe.transformer = torch.compile(
pipe.transformer,
mode="reduce-overhead",
fullgraph=True
)
The first generation will be slower due to compilation, but subsequent generations are faster. Worth it for batch generation workflows.
Use Schnell for Iteration, Dev for Finals
A productive workflow: use Flux.1 Schnell at 4 steps to quickly explore compositions and prompts, then switch to Flux.1 Dev at 28 steps for your final image. Schnell is roughly 7x faster, making it ideal for the creative exploration phase.
Resolution Matters
Flux works best at 1024x1024. You can generate at lower resolutions (768x768, 512x512) for faster iteration, but the model was trained primarily on 1024px images. Going above 1024x1024 can cause artifacts without specific high-resolution techniques.
FP8 Is Usually Enough
The quality difference between FP16 and FP8 is minimal for most use cases. Unless you are doing professional work where subtle detail differences matter, FP8 saves you 16GB of VRAM with negligible quality loss. Start with FP8 and only move to FP16 if you notice issues.
Troubleshooting
Out of memory errors: Try GGUF quantization first (Q4-Q6). If still tight, enable sequential offloading. Reduce resolution to 768x768 as a last resort.
Slow generation: Check that your model is on GPU, not CPU. Verify CUDA is available (torch.cuda.is_available()). Close other GPU-consuming applications. Consider using Schnell instead of Dev.
Black or corrupted images: Usually a precision mismatch. Ensure your VAE and text encoders are loaded at compatible precision. FP16 VAE with FP8 transformer works; mixing FP32 and FP16 components can cause issues.
Text encoder loading fails: T5-XXL is large (9.5GB at FP16). Use the FP8 T5 encoder (t5xxl_fp8_e4m3fn.safetensors) to save memory. It has negligible quality impact.
Summary
Flux.1 is accessible on a wider range of hardware than its 33GB headline number suggests. With FP8 precision and GGUF quantization, you can run it on GPUs starting at 12GB VRAM.
- 24GB+ VRAM: FP8 natively, ControlNets available, best experience
- 16GB VRAM: GGUF Q6-Q8, very good quality
- 12GB VRAM: GGUF Q4-Q5 or sequential offloading, acceptable quality
- 8GB VRAM: Consider SDXL instead — better experience on limited hardware
Check if Flux fits your hardware | Compare Flux vs SDXL vs SD 3.5
Related reading: Best Local Image Generation Models | Flux vs SDXL vs SD 3.5 | GGUF Quantization Explained