Will It Run AI
qwen, image-generation, hardware, alibaba

Qwen Image — Running Alibaba's 20B Diffusion Model Locally

Guide to running Qwen Image locally. Hardware requirements for its 20.4B DiT transformer and 8.3B Qwen2.5-VL text encoder, diffusers setup, VRAM optimization, and comparison with Flux.

Qwen Image is Alibaba's entry into the open-weight image generation space, and it is one of the largest diffusion models available for local inference. With a 20.4B parameter DiT transformer paired with an 8.3B Qwen2.5-VL text encoder, it demands serious hardware but delivers results that compete with the best commercial models.


What Is Qwen Image?

Qwen Image is a text-to-image diffusion model from Alibaba's Qwen team. It combines two large components:

  • 20.4B DiT Transformer — the core diffusion backbone, one of the largest open-weight image transformers available
  • 8.3B Qwen2.5-VL Text Encoder — a vision-language model repurposed as a text encoder, providing deep semantic understanding

This architecture gives Qwen Image several distinctive strengths:

  • Bilingual prompt support — native understanding of both English and Chinese prompts, not just translation-based
  • Complex scene composition — the large parameter count enables better handling of multi-subject, multi-attribute scenes
  • Strong text rendering — benefits from the VL encoder's language understanding
  • High-resolution output — supports generation up to 1024x1024 and beyond

The model is released under a research license from Alibaba, with the Qwen community providing the primary support and tooling.


Hardware Requirements

The combined 28.7B parameters across transformer and text encoder make Qwen Image one of the most VRAM-hungry image models:

ConfigurationVRAM RequiredNotes
FP16 (full precision)~42 GBNeeds A100/H100 or dual-GPU setup
FP8~22 GBFits on RTX 4090 with tight margins
FP8 + sequential offload~14 GBUsable on 16GB GPUs, much slower
FP16 + sequential offload~16 GBComponents swap between CPU and GPU

Recommended hardware by tier:

  • A100 80GB / H100 80GB: Full FP16, fastest generation, no compromises. The intended experience.
  • RTX 4090 / RTX 5090 (24-32GB): FP8 fits with care. Expect 15-20 seconds per image. Workable for personal use.
  • RTX 4080 / RTX 4070 Ti Super (16GB): Sequential offloading required. Generations take 45+ seconds. Feasible for experimentation.
  • Apple Silicon M4 Ultra (192GB unified): FP16 fits in unified memory. MPS backend works but is slower than CUDA.
  • Under 16GB VRAM: Not practical. Consider Flux 2 Dev or Flux 2 Klein 4B instead.

Qwen Image Edit — The Companion Model

Alongside the text-to-image model, Alibaba released Qwen Image Edit, a model designed for instruction-based image editing. Rather than generating from a text prompt alone, it takes an input image plus a text instruction and produces a modified version.

Use cases include:

  • Changing object colors, styles, or positions
  • Adding or removing elements from a scene
  • Style transfer guided by natural language
  • Background replacement with text instructions

Qwen Image Edit shares the same architecture and similar VRAM requirements. If you can run Qwen Image, you can run the Edit variant with the same hardware setup.


Running Qwen Image with Diffusers

The HuggingFace diffusers library provides the most straightforward path to running Qwen Image locally.

FP16 (Full Precision — ~42GB VRAM)

import torch
from diffusers import QwenImagePipeline

pipe = QwenImagePipeline.from_pretrained(
    "Qwen/Qwen-Image",
    torch_dtype=torch.float16
)
pipe.to("cuda")

image = pipe(
    prompt="A traditional Chinese garden with a koi pond, cherry blossoms falling",
    num_inference_steps=30,
    guidance_scale=4.0,
    width=1024,
    height=1024,
).images[0]
image.save("qwen_image_output.png")

FP8 for RTX 4090 (~22GB VRAM)

import torch
from diffusers import QwenImagePipeline

pipe = QwenImagePipeline.from_pretrained(
    "Qwen/Qwen-Image",
    torch_dtype=torch.float16
)
pipe.transformer.to(torch.float8_e4m3fn)
pipe.to("cuda")

image = pipe(
    prompt="A traditional Chinese garden with a koi pond, cherry blossoms falling",
    num_inference_steps=30,
    guidance_scale=4.0,
).images[0]

Sequential Offloading for 16GB GPUs

import torch
from diffusers import QwenImagePipeline

pipe = QwenImagePipeline.from_pretrained(
    "Qwen/Qwen-Image",
    torch_dtype=torch.float16
)
pipe.enable_sequential_cpu_offload()

image = pipe(
    prompt="A traditional Chinese garden with a koi pond, cherry blossoms falling",
    num_inference_steps=30,
    guidance_scale=4.0,
).images[0]

With sequential offloading, expect generation times of 45-60 seconds on an RTX 4080. The 8.3B text encoder alone takes significant time to process through limited VRAM.


Qwen Image vs Flux — How Do They Compare

The two models take different approaches to local image generation:

AspectQwen ImageFlux 2 Dev
Parameters20.4B (+ 8.3B encoder)12B (+ 4.7B encoder)
VRAM at FP16~42 GB~24 GB
VRAM at FP8~22 GB~12 GB
Language supportEnglish + ChineseEnglish
Text renderingStrongStrong
Ecosystem (LoRAs)GrowingExtensive
LicenseResearchNon-commercial

Choose Qwen Image when:

  • You need bilingual Chinese and English prompt support
  • You have access to high-VRAM hardware (A100, H100, or multi-GPU)
  • Complex multi-subject scenes are your primary use case

Choose Flux 2 Dev when:

  • You want the broadest ecosystem of LoRAs and ControlNets
  • Your GPU has 12-24GB VRAM
  • You need faster generation times
  • Community tooling and workflow support matter

For most users with consumer GPUs, Flux remains the more practical choice. Qwen Image is the model to reach for when you have the hardware to support it and need its specific strengths.


Optimization Tips

System RAM matters. Sequential offloading stores model weights in CPU RAM. With a 28.7B parameter model, you need at least 64GB system RAM for comfortable offloading. 32GB is possible but will involve disk swapping.

FP8 is essential on consumer hardware. The jump from 42GB to 22GB makes the difference between "impossible" and "workable" on an RTX 4090. Quality loss at FP8 is minimal.

Batch size of 1. Unlike smaller models, Qwen Image leaves little headroom for batch generation on consumer GPUs. Generate one image at a time and use prompt iteration to explore variations.

torch.compile helps. As with other large DiT models, compiling the transformer with PyTorch 2.0+ reduces overhead on repeated generations:

pipe.transformer = torch.compile(
    pipe.transformer,
    mode="reduce-overhead",
    fullgraph=True
)

Summary

Qwen Image pushes the boundary of what is available for local image generation. Its 20.4B parameter transformer and 8.3B VL text encoder deliver impressive results, particularly for bilingual prompts and complex scenes, but the hardware requirements are steep.

  • 80GB+ VRAM (A100/H100): Full FP16, best experience
  • 24GB VRAM (RTX 4090): FP8, workable with some patience
  • 16GB VRAM: Sequential offloading only, slow but functional
  • Under 16GB: Look at Flux 2 Dev or Klein 4B instead

Check Qwen Image hardware compatibility | Compare Qwen Image vs Flux 2 Dev | Compare Qwen Image vs SDXL


Related reading: How to Run Flux 2 Locally | Best Local Image Generation Models | How to Run Flux Locally

Frequently Asked Questions

How much VRAM does Qwen Image need?

Qwen Image at FP16 requires approximately 42GB VRAM due to its 20.4B DiT transformer and 8.3B text encoder. With FP8 quantization, it drops to roughly 22GB. Sequential offloading can bring it below 16GB at the cost of speed.

Can I run Qwen Image on an RTX 4090?

Not at FP16 — the 24GB VRAM is insufficient for the full 42GB footprint. However, with FP8 quantization (~22GB) and careful memory management, or with sequential offloading, an RTX 4090 can run Qwen Image with some tradeoffs.

What makes Qwen Image different from Flux?

Qwen Image is significantly larger (20.4B vs 12B parameters) and uses a Qwen2.5-VL vision-language model as its text encoder, giving it strong bilingual understanding in both English and Chinese. It excels at complex scene composition and text-heavy prompts.

What is Qwen Image Edit?

Qwen Image Edit is a companion model designed for image editing tasks rather than generation from scratch. It accepts an input image plus text instructions and produces edited results, similar to InstructPix2Pix but built on the Qwen architecture.