Video Generation Models

Check if your GPU can run LTX Video, CogVideoX and other video generation models locally. Video models are VRAM-hungry — most require 16GB+ for basic generation.

Not sure which model to pick? Browse by workflow — choose by what you want to create.

Short Form High Quality Image to Video 24 GB Budget 48 GB

Video

AlibabaWan Video 2.2 14B

14B paramsup to 1280×720~47 GB VRAM81 frames max

video-generationtext-to-videoimage-to-video

Top tier

Updated version of Wan Video 2.1. Improved motion quality, better temporal coherence, and higher visual fidelity. 14B parameter 3D DiT architecture with Apache 2.0 license. Supports text-to-video and text+image-to-video generation.

16 fps · 3D-DIT

LightricksLTX-2 22B

22B paramsup to 1920×1088~54.4 GB VRAM241 frames max

video-generationtext-to-videoimage-to-video

Top tier

LTX-2 (ltx-2.3-22b-dev) is Lightricks' DiT-based audio-video foundation model that generates synchronized video and audio in a single pass. ~22B parameters; a distilled variant runs at 8 steps with CFG=1 for fast generation. Width/height must be divisible by 32 and frame count must be divisible by 8 plus 1.

30 fps · 3D-DIT

Video

LightricksLTX Video 13B

13B paramsup to 1280×720~35.6 GB VRAM257 frames max

video-generationtext-to-videoimage-to-video

Top tier

Highest quality LTX Video model at 13B parameters. Available in dev (best quality), distilled (faster), and FP8 (lower VRAM) variants. Produces high-fidelity video with strong temporal coherence.

24 fps · 3D-DIT

Video

AlibabaWan Video 2.1 14B

14B paramsup to 1280×720~47 GB VRAM81 frames max

video-generationtext-to-videoimage-to-video

Top tier

State-of-the-art open-source video generation model from Alibaba. 14B parameter 3D DiT with exceptional motion quality, temporal coherence, and visual fidelity. Supports text-to-video and image-to-video.

16 fps · 3D-DIT

Video

LightricksLTX Video 2B

2B paramsup to 1280×720~13.6 GB VRAM161 frames max

video-generationtext-to-videoimage-to-video

High

Lightweight 2B video generation model from Lightricks (v0.9.8). Available in dev, distilled, and FP8 variants. The distilled version generates video faster than real-time at lower resolutions. For best quality, use the 13B variant instead.

24 fps · 3D-DIT

Video

Sand AIMAGI-1

24B paramsup to 1280×720~57.6 GB VRAM120 frames max

video-generationstreaming-videocinematic

High

24B autoregressive diffusion model for streaming video generation. Produces high-quality cinematic video with strong temporal coherence. Requires 80GB+ VRAM for full inference. Apache 2.0 licensed.

24 fps · 3D-DIT

SulphurAISulphur 2

9B paramsup to 1280×720~28.4 GB VRAM241 frames max

video-generationtext-to-videoimage-to-video

High

Sulphur 2 (base) is a 9B uncensored video generation model built on the LTX-2.3 architecture, supporting native text-to-video and image-to-video. Ships with a built-in prompt enhancer and quantized GGUF builds (~9.5GB Q8_0) for consumer hardware.

30 fps · 3D-DIT

Video

TencentHunyuanVideo

13B paramsup to 1280×720~40.2 GB VRAM129 frames max

video-generationtext-to-video

High

Large-scale video generation model from Tencent. 13B parameter 3D DiT with Hunyuan-Large MLLM text encoder (~7B, not T5-based). Strong motion quality and visual fidelity up to 720p.

24 fps · 3D-DIT

Helios 14B real-time video generation at 19.5 FPS

VideoBestWishYshHelios 14B

14B paramsup to 1280×720~37.6 GB VRAM240 frames max

video-generationreal-time-videolong-video

High

14B distilled video model achieving 19.5 FPS real-time generation. Based on Wan2.1-T2V-14B with pyramid distillation for minute-scale coherent video. Apache 2.0 licensed.

24 fps · 3D-DIT

Video

TencentHunyuanVideo 1.5

8.3B paramsup to 1280×720~30.8 GB VRAM129 frames max

video-generationtext-to-videoimage-to-video

High

Consumer-oriented successor to HunyuanVideo 13B from Tencent. 8.3B parameter 3D DiT supporting both text-to-video and image-to-video (T2V + I2V). Step-distilled variant runs 480p at ~75s on RTX 4090; minimum ~14GB VRAM with offload in FP16.

24 fps · 3D-DIT

SkyReels V2 14B infinite-length video generation sample

VideoSkyworkSkyReels V2 14B

14B paramsup to 1280×720~37.6 GB VRAM121 frames max

video-generationlong-videoinfinite-length

High

14B diffusion-forcing video model supporting infinite-length video generation at 720p. Uses autoregressive diffusion-forcing architecture for seamless long-form video without quality degradation.

24 fps · 3D-DIT

Video

GenmoMochi 1 Preview

10B paramsup to 848×480~29.6 GB VRAM84 frames max

video-generationtext-to-videocinematic

High

10B parameter video generation model from Genmo using AsymmDiT architecture with T5-XXL text encoder. Generates 848x480 videos at 30fps with strong motion quality. Apache 2.0 licensed.

30 fps · 3D-DIT

Wan Video 2.2 TI2V 5B text+image-to-video sample input

VideoWan-AIWan2.2 TI2V 5B

5B paramsup to 832×480~19.6 GB VRAM81 frames max

video-generationtext-image-to-videoaccessible

High

5B text+image-to-video model from the Wan 2.2 family. Runs on consumer GPUs with 8GB+ VRAM. Takes text and reference image as input to generate coherent video clips.

16 fps · 3D-DIT

CogVideoX-5B text-to-video generation sample frame

VideoTHUDMCogVideoX 5B

5B paramsup to 720×480~19.6 GB VRAM49 frames max

video-generationtext-to-video

High

Open-source video generation model from Tsinghua University. 3D full-attention transformer with expert adaptive LayerNorm. Generates 6-second clips at 8fps.

8 fps · 3D-DIT

NVIDIA Cosmos Diffusion 7B world generation frame

Video

NVIDIACosmos Diffusion 7B

7B paramsup to 1024×576~23.6 GB VRAM57 frames max

video-generationtext-to-videoworld-model

High

7B diffusion model from NVIDIA's Cosmos platform for physical AI and world modeling. Generates physically plausible videos from text descriptions. Part of NVIDIA's Physical AI initiative.

24 fps · DIT

FramePack I2V image-to-video generation with HunyuanVideo backbone

VideolllyasvielFramePack I2V

13B paramsup to 1280×720~40.2 GB VRAM129 frames max

video-generationimage-to-videolow-vram

High

Viral low-VRAM video generation model based on HunyuanVideo architecture. Uses a novel next-frame prediction approach that inverts the diffusion process to pack future frames into the noise of the current frame, enabling video generation with only 6GB VRAM. Image-to-video with strong motion quality.

30 fps · 3D-DIT

Video

AlibabaWan Video 2.1 1.3B

1.3B paramsup to 832×480~21.6 GB VRAM81 frames max

video-generationtext-to-videofast-generation

High

Lightweight video generation model from Alibaba. Only 1.3B params — runs on consumer GPUs with 8GB+ VRAM. Good quality for its size, excellent for rapid iteration.

16 fps · 3D-DIT

CogVideoX-2B text-to-video generation sample frame

VideoTHUDMCogVideoX 2B

2B paramsup to 720×480~13.6 GB VRAM49 frames max

video-generationtext-to-videoaccessible

Mid

Lightweight 2B video generation model from Tsinghua University. Most accessible CogVideoX variant, runs on 8GB+ VRAM with quantization. Generates 6-second clips at 8fps. Apache 2.0 licensed.

8 fps · 3D-DIT

Animated scene generated by AnimateDiff v1.5-3 showing a sunset over the ocean

VideoguoywwAnimateDiff v1.5.3

0.4B paramsup to 512×512~1.2 GB VRAM16 frames max

video-generationanimationmotion

Mid

Motion adapter module that plugs into any SD 1.5 checkpoint to generate short animated clips. Only 0.4B extra parameters on top of the SD 1.5 base model. Generates 16 frames at 8fps (2-second clips).

8 fps · UNET