Qwen Image — Running Alibaba's 20B Diffusion Model Locally
Guide to running Qwen Image locally. Hardware requirements for its 20.4B DiT transformer and 8.3B Qwen2.5-VL text encoder, diffusers setup, VRAM optimization, and comparison with Flux.
Qwen Image is Alibaba's entry into the open-weight image generation space, and it is one of the largest diffusion models available for local inference. With a 20.4B parameter DiT transformer paired with an 8.3B Qwen2.5-VL text encoder, it demands serious hardware but delivers results that compete with the best commercial models.
What Is Qwen Image?
Qwen Image is a text-to-image diffusion model from Alibaba's Qwen team. It combines two large components:
- 20.4B DiT Transformer — the core diffusion backbone, one of the largest open-weight image transformers available
- 8.3B Qwen2.5-VL Text Encoder — a vision-language model repurposed as a text encoder, providing deep semantic understanding
This architecture gives Qwen Image several distinctive strengths:
- Bilingual prompt support — native understanding of both English and Chinese prompts, not just translation-based
- Complex scene composition — the large parameter count enables better handling of multi-subject, multi-attribute scenes
- Strong text rendering — benefits from the VL encoder's language understanding
- High-resolution output — supports generation up to 1024x1024 and beyond
The model is released under a research license from Alibaba, with the Qwen community providing the primary support and tooling.
Hardware Requirements
The combined 28.7B parameters across transformer and text encoder make Qwen Image one of the most VRAM-hungry image models:
| Configuration | VRAM Required | Notes |
|---|---|---|
| FP16 (full precision) | ~42 GB | Needs A100/H100 or dual-GPU setup |
| FP8 | ~22 GB | Fits on RTX 4090 with tight margins |
| FP8 + sequential offload | ~14 GB | Usable on 16GB GPUs, much slower |
| FP16 + sequential offload | ~16 GB | Components swap between CPU and GPU |
Recommended hardware by tier:
- A100 80GB / H100 80GB: Full FP16, fastest generation, no compromises. The intended experience.
- RTX 4090 / RTX 5090 (24-32GB): FP8 fits with care. Expect 15-20 seconds per image. Workable for personal use.
- RTX 4080 / RTX 4070 Ti Super (16GB): Sequential offloading required. Generations take 45+ seconds. Feasible for experimentation.
- Apple Silicon M4 Ultra (192GB unified): FP16 fits in unified memory. MPS backend works but is slower than CUDA.
- Under 16GB VRAM: Not practical. Consider Flux 2 Dev or Flux 2 Klein 4B instead.
Qwen Image Edit — The Companion Model
Alongside the text-to-image model, Alibaba released Qwen Image Edit, a model designed for instruction-based image editing. Rather than generating from a text prompt alone, it takes an input image plus a text instruction and produces a modified version.
Use cases include:
- Changing object colors, styles, or positions
- Adding or removing elements from a scene
- Style transfer guided by natural language
- Background replacement with text instructions
Qwen Image Edit shares the same architecture and similar VRAM requirements. If you can run Qwen Image, you can run the Edit variant with the same hardware setup.
Running Qwen Image with Diffusers
The HuggingFace diffusers library provides the most straightforward path to running Qwen Image locally.
FP16 (Full Precision — ~42GB VRAM)
import torch
from diffusers import QwenImagePipeline
pipe = QwenImagePipeline.from_pretrained(
"Qwen/Qwen-Image",
torch_dtype=torch.float16
)
pipe.to("cuda")
image = pipe(
prompt="A traditional Chinese garden with a koi pond, cherry blossoms falling",
num_inference_steps=30,
guidance_scale=4.0,
width=1024,
height=1024,
).images[0]
image.save("qwen_image_output.png")
FP8 for RTX 4090 (~22GB VRAM)
import torch
from diffusers import QwenImagePipeline
pipe = QwenImagePipeline.from_pretrained(
"Qwen/Qwen-Image",
torch_dtype=torch.float16
)
pipe.transformer.to(torch.float8_e4m3fn)
pipe.to("cuda")
image = pipe(
prompt="A traditional Chinese garden with a koi pond, cherry blossoms falling",
num_inference_steps=30,
guidance_scale=4.0,
).images[0]
Sequential Offloading for 16GB GPUs
import torch
from diffusers import QwenImagePipeline
pipe = QwenImagePipeline.from_pretrained(
"Qwen/Qwen-Image",
torch_dtype=torch.float16
)
pipe.enable_sequential_cpu_offload()
image = pipe(
prompt="A traditional Chinese garden with a koi pond, cherry blossoms falling",
num_inference_steps=30,
guidance_scale=4.0,
).images[0]
With sequential offloading, expect generation times of 45-60 seconds on an RTX 4080. The 8.3B text encoder alone takes significant time to process through limited VRAM.
Qwen Image vs Flux — How Do They Compare
The two models take different approaches to local image generation:
| Aspect | Qwen Image | Flux 2 Dev |
|---|---|---|
| Parameters | 20.4B (+ 8.3B encoder) | 12B (+ 4.7B encoder) |
| VRAM at FP16 | ~42 GB | ~24 GB |
| VRAM at FP8 | ~22 GB | ~12 GB |
| Language support | English + Chinese | English |
| Text rendering | Strong | Strong |
| Ecosystem (LoRAs) | Growing | Extensive |
| License | Research | Non-commercial |
Choose Qwen Image when:
- You need bilingual Chinese and English prompt support
- You have access to high-VRAM hardware (A100, H100, or multi-GPU)
- Complex multi-subject scenes are your primary use case
Choose Flux 2 Dev when:
- You want the broadest ecosystem of LoRAs and ControlNets
- Your GPU has 12-24GB VRAM
- You need faster generation times
- Community tooling and workflow support matter
For most users with consumer GPUs, Flux remains the more practical choice. Qwen Image is the model to reach for when you have the hardware to support it and need its specific strengths.
Optimization Tips
System RAM matters. Sequential offloading stores model weights in CPU RAM. With a 28.7B parameter model, you need at least 64GB system RAM for comfortable offloading. 32GB is possible but will involve disk swapping.
FP8 is essential on consumer hardware. The jump from 42GB to 22GB makes the difference between "impossible" and "workable" on an RTX 4090. Quality loss at FP8 is minimal.
Batch size of 1. Unlike smaller models, Qwen Image leaves little headroom for batch generation on consumer GPUs. Generate one image at a time and use prompt iteration to explore variations.
torch.compile helps. As with other large DiT models, compiling the transformer with PyTorch 2.0+ reduces overhead on repeated generations:
pipe.transformer = torch.compile(
pipe.transformer,
mode="reduce-overhead",
fullgraph=True
)
Summary
Qwen Image pushes the boundary of what is available for local image generation. Its 20.4B parameter transformer and 8.3B VL text encoder deliver impressive results, particularly for bilingual prompts and complex scenes, but the hardware requirements are steep.
- 80GB+ VRAM (A100/H100): Full FP16, best experience
- 24GB VRAM (RTX 4090): FP8, workable with some patience
- 16GB VRAM: Sequential offloading only, slow but functional
- Under 16GB: Look at Flux 2 Dev or Klein 4B instead
Check Qwen Image hardware compatibility | Compare Qwen Image vs Flux 2 Dev | Compare Qwen Image vs SDXL
Related reading: How to Run Flux 2 Locally | Best Local Image Generation Models | How to Run Flux Locally