CogVideoX 5B

Name: CogVideoX 5B
Author: THUDM

Stable

by THUDM

Open-source video generation model from Tsinghua University. 3D full-attention transformer with expert adaptive LayerNorm. Generates 6-second clips at 8fps.

Full 3D attention transformer
6-second video clips at 8fps
5B parameters — runs on 24GB+ VRAM
Open research model from Tsinghua University

HuggingFace GitHub Paper Documentation

30K downloads

Your hardware

Detecting...

Parameters5B

Max Resolution720×480

Max Frames49

FPS8

Architecture3D-DIT

Licensecogvideox

Image Quality Benchmarks

Measured quality metrics for CogVideoX 5B outputs.

Human Preference Score72%

How often humans prefer this model's output (0-100%)

Aesthetic Score6.8

Visual quality and composition rating (5-9 scale)

This model requires 27+ GB VRAM for basic video generation. A GPU with 24GB+ VRAM is recommended.

VRAM by Scenario

VRAM estimates at FP16 and FP8 precision. FP8 uses ~40% less memory with minimal quality loss. Grade shows how well each GPU handles the generation workload.

FP16 (full precision)

Scenario	VRAM	RTX 4090 24GB	RTX 3060 12GB	RTX 4060 8GB	MacBook Pro M4 Pro 24GB
512×512 · 25 frames	25.3 GB	B	F	F	F
768×512 · 25 frames	27.4 GB	B	F	F	F
768×512 · 100 frames	33.7 GB	F	F	F	F
1280×720 · 25 frames	35.9 GB	F	F	F	F

FP8 (quantized — ~40% less VRAM)

Scenario	VRAM	RTX 4090 24GB	RTX 3060 12GB	RTX 4060 8GB	MacBook Pro M4 Pro 24GB
512×512 · 25 frames	15.1 GB	S	D	F	A
768×512 · 25 frames	17.2 GB	S	F	F	B
768×512 · 100 frames	23.5 GB	B	F	F	D
1280×720 · 25 frames	25.7 GB	B	F	F	F

Optimization Tips

Turbo / LCM distillation

Use distilled scheduler at 4-8 steps for faster iteration

Run with Python

Run with Python (diffusers)

from diffusers import CogVideoXPipeline
import torch

pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-5b",
    torch_dtype=torch.float16
)
pipe.to("cuda")

frames = pipe(
    prompt="your prompt here",
    num_inference_steps=50,
    guidance_scale=6.0,
    num_frames=49,
).frames[0]
# Save frames or export as video

Get started

Setup instructions for running CogVideoX 5B locally

1. Download the model

Get the checkpoint from HuggingFace

2. Place in:

ComfyUI/models/checkpoints/

3. Launch ComfyUI

python main.py

Note: Video generation requires video output nodes. Install ComfyUI-VideoHelperSuite from the ComfyUI Manager for SaveAnimatedWEBP or VHS_VideoCombine nodes.

Memory Breakdown

VRAM allocation for 25 frames at 768×512 on RTX 4090 24GB

Required: 27.4 GBAvailable: 24.0 GB

Weights10.0 GB

VAE0.2 GB

Text Encoder9.4 GB

Activations6.0 GB

Overhead0.5 GB

Estimated Generation Time

25 frames at 768×512, 30 steps, FP16.

RTX 4090 24GB~4m 23s

RTX 3060 12GB~7m 52s

RTX 4060 8GB~11m 53s

MacBook Pro M4 Pro 24GB~16m 55s

Sample Outputs

Available Formats & Downloads

Download CogVideoX 5B in different precisions. Lower precision = less VRAM but slight quality loss.

フォーマット	精度	サイズ	プロバイダー
safetensors	FP16	10.3 GB	official	ダウンロード

LoRA Ecosystem

Limited

Few LoRAs available for CogVideoX.

Related Workflows

Browse Workflows →

Cosmos Diffusion 7B7B · NVIDIA Mochi 1 Preview10B · Genmo Wan2.2 TI2V 5B5B · Wan-AI

Frequently asked questions

FAQ — CogVideoX 5B

How much VRAM does CogVideoX 5B need for video?

CogVideoX 5B (5B parameters) requires approximately 27.4 GB of VRAM at FP16 precision for generating 25 frames at 768×512. Video generation typically requires more VRAM than image generation due to temporal attention layers.

Can I run CogVideoX 5B on RTX 4090?

CogVideoX 5B can run on the RTX 4090 with sequential offloading, though video generation will be significantly slower than native fit.

How long does it take to generate a video with CogVideoX 5B?

On a reference GPU (RTX 4090 24GB), CogVideoX 5B generates a 25-frame video at 768×512 in approximately ~4m 23s at FP16 with 30 inference steps. Faster GPUs with higher memory bandwidth will reduce generation time.

What resolution and frame count does CogVideoX 5B support?

CogVideoX 5B supports up to 720×480 resolution and 49 frames per generation at 8 FPS. Higher resolutions and frame counts require proportionally more VRAM.

About CogVideoX 5B

Use cases

video-generationtext-to-video

Recommended runtimes

comfyuidiffusers