CogVideoX 5B
Stableby THUDM
Open-source video generation model from Tsinghua University. 3D full-attention transformer with expert adaptive LayerNorm. Generates 6-second clips at 8fps.
- Full 3D attention transformer
- 6-second video clips at 8fps
- 5B parameters — runs on 24GB+ VRAM
- Open research model from Tsinghua University
Your hardware
Detecting...
Image Quality Benchmarks
Measured quality metrics for CogVideoX 5B outputs.
How often humans prefer this model's output (0-100%)
Visual quality and composition rating (5-9 scale)
VRAM by Scenario
VRAM estimates at FP16 and FP8 precision. FP8 uses ~40% less memory with minimal quality loss. Grade shows how well each GPU handles the generation workload.
FP16 (full precision)
| Scenario | VRAM | RTX 4090 24GB | RTX 3060 12GB | RTX 4060 8GB | MacBook Pro M4 Pro 24GB |
|---|---|---|---|---|---|
| 512×512 · 25 frames | 25.3 GB | B | F | F | F |
| 768×512 · 25 frames | 27.4 GB | B | F | F | F |
| 768×512 · 100 frames | 33.7 GB | F | F | F | F |
| 1280×720 · 25 frames | 35.9 GB | F | F | F | F |
FP8 (quantized — ~40% less VRAM)
| Scenario | VRAM | RTX 4090 24GB | RTX 3060 12GB | RTX 4060 8GB | MacBook Pro M4 Pro 24GB |
|---|---|---|---|---|---|
| 512×512 · 25 frames | 15.1 GB | S | D | F | A |
| 768×512 · 25 frames | 17.2 GB | S | F | F | B |
| 768×512 · 100 frames | 23.5 GB | B | F | F | D |
| 1280×720 · 25 frames | 25.7 GB | B | F | F | F |
Optimization Tips
Turbo / LCM distillation
Use distilled scheduler at 4-8 steps for faster iteration
Run with Python
from diffusers import CogVideoXPipeline
import torch
pipe = CogVideoXPipeline.from_pretrained(
"THUDM/CogVideoX-5b",
torch_dtype=torch.float16
)
pipe.to("cuda")
frames = pipe(
prompt="your prompt here",
num_inference_steps=50,
guidance_scale=6.0,
num_frames=49,
).frames[0]
# Save frames or export as videoGet started
Setup instructions for running CogVideoX 5B locally
1. Download the model
Get the checkpoint from HuggingFace
2. Place in:
ComfyUI/models/checkpoints/3. Launch ComfyUI
python main.pyMemory Breakdown
VRAM allocation for 25 frames at 768×512 on RTX 4090 24GB
Estimated Generation Time
25 frames at 768×512, 30 steps, FP16.
Sample Outputs
Available Formats & Downloads
Download CogVideoX 5B in different precisions. Lower precision = less VRAM but slight quality loss.
| フォーマット | 精度 | サイズ | プロバイダー | |
|---|---|---|---|---|
| safetensors | FP16 | 10.3 GB | official | ダウンロード |
LoRA Ecosystem
LimitedFew LoRAs available for CogVideoX.
Related Workflows
You might also like
Frequently asked questions
FAQ — CogVideoX 5B
How much VRAM does CogVideoX 5B need for video?
CogVideoX 5B (5B parameters) requires approximately 27.4 GB of VRAM at FP16 precision for generating 25 frames at 768×512. Video generation typically requires more VRAM than image generation due to temporal attention layers.
Can I run CogVideoX 5B on RTX 4090?
CogVideoX 5B can run on the RTX 4090 with sequential offloading, though video generation will be significantly slower than native fit.
How long does it take to generate a video with CogVideoX 5B?
On a reference GPU (RTX 4090 24GB), CogVideoX 5B generates a 25-frame video at 768×512 in approximately ~4m 23s at FP16 with 30 inference steps. Faster GPUs with higher memory bandwidth will reduce generation time.
What resolution and frame count does CogVideoX 5B support?
CogVideoX 5B supports up to 720×480 resolution and 49 frames per generation at 8 FPS. Higher resolutions and frame counts require proportionally more VRAM.
About CogVideoX 5B
See also