DeepSeek

DeepSeek V4 Pro

Name: DeepSeek V4 Pro
Author: DeepSeek

Frontier

1.3MDownloads5.2KLikesApr 2026Released1.0M tokensContextMITLicense100 ExceptionalQuality

DeepSeek V4 Pro (1600B parameters) requires approximately 865.4 GB of VRAM with NVFP4 quantization. As a Mixture of Experts model with 49B active parameters, it uses less memory than its total parameter count suggests. For the best balance of quality and speed, we recommend hardware with at least 996 GB of VRAM.

Get started

— copy & paste to run locally

Copy-paste commands to run DeepSeek V4 Pro on your machine.

Run

docker run --rm -it ghcr.io/ggerganov/llama.cpp:full \
  --hf-repo "deepseek-ai/DeepSeek-V4-Pro" \
  --hf-file "DeepSeek-V4-Pro-NVFP4.gguf" \
  -c 4096 -ngl 99

Quick specs

Parameters1600B (49B active)

Architecturemoe (MoE)

Context1.0M tokens

Modalitytext

Min RAM624 GB

Rec. RAM896 GB (NVFP4)

LicenseMIT

FamilyDeepSeek

✓ Code✓ Reasoning

About this model

DeepSeek V4 Pro is a 1.6T-parameter sparse MoE (49B active, 384 routed + 1 shared expert) built for million-token agentic reasoning. Experts ship natively in FP4, so the real on-disk footprint is roughly 862 GB (FP4 experts + FP8 attention) rather than the trillion-scale FP16 size — but it is still a server/workstation deployment: realistic local use targets 8x 80GB GPUs or 1 TB+ unified memory, and at long Think Max contexts the KV cache dominates.

•1.6T total / 49B active sparse MoE — 384 routed + 1 shared expert
•Native FP4 experts: ~862 GB on disk, not trillion-scale FP16
•1M-token context for million-token agent workflows
•Server/workstation class — use distills or the Flash variant for local

Related models

Inference speed

DeepSeek V4 Pro inference speed — tokens per second by GPU & Mac

Estimated decode speed (tokens/sec) for DeepSeek V4 Pro at NVFP4 across popular GPUs and Apple Silicon, including multi-GPU rigs, using the fastest local runtime per device. Fastest is RTX 5090 32GB at ~2 tok/s. Speed is memory-bandwidth bound, so cards that fit the whole model in VRAM run far faster than ones that offload to system RAM.

GPU / Mac	Memory	Quant	Speed (tok/s)	Fits?
RTX 5090 32GB	32 GB	NVFP4	2.0	Too big
RTX 4090 24GB	24 GB	NVFP4	2.0	Too big
RTX 4080 Super 16GB	16 GB	NVFP4	2.0	Too big
RTX 3090 24GB	24 GB	NVFP4	2.0	Too big
RTX 4070 12GB	12 GB	NVFP4	2.0	Too big
RTX 3060 12GB	12 GB	NVFP4	2.0	Too big
RTX 4060 8GB	8 GB	NVFP4	2.0	Too big
RX 7900 XTX 24GB	24 GB	NVFP4	2.0	Too big
MacBook Pro M4 Max 128GB	128 GB	NVFP4	2.0	Too big
Mac Studio M3 Ultra 256GB	256 GB	NVFP4	2.0	Too big
Mac Studio M2 Ultra 128GB	128 GB	NVFP4	2.0	Too big
Mac Studio M1 Ultra 128GB	128 GB	NVFP4	2.0	Too big
MacBook Pro M4 Max 64GB	64 GB	NVFP4	2.0	Too big
MacBook Pro M3 Max 64GB	64 GB	NVFP4	2.0	Too big
MacBook Pro M1 Max 64GB	64 GB	NVFP4	2.0	Too big
MacBook Pro M4 Pro 48GB	48 GB	NVFP4	2.0	Too big
2× RTX 4090 24GB	48 GB	NVFP4	2.0	Too big
2× RTX 3090 24GB	48 GB	NVFP4	2.0	Too big
2× RX 7900 XTX 24GB	48 GB	NVFP4	2.0	Too big
4× RTX 3060 12GB	48 GB	NVFP4	2.0	Too big

Estimates for single-stream decoding at NVFP4; real tokens/sec varies with prompt length, context, batch size, and runtime build. Prompt processing (prefill) is faster than the decode figures shown here.

Quantization

DeepSeek V4 Pro quantization — VRAM & quality by quant level

How much VRAM DeepSeek V4 Pro (1600B) needs at each GGUF quant, and whether it fits a 24 GB card (RTX 4090 / 3090). The recommended NVFP4 uses ~896 GB — about 48% less VRAM than Q8_0, at a small quality cost.

Quant	Bits	VRAM (weights)	Quality	Fits 24 GB?
Q2_K	2	624 GB	Low	Too big
Q3_K_S	3	784 GB	Low	Too big
NVFP4recommended	4	896 GB	Medium	Too big
Q4_K_M	4	976 GB	Medium	Too big
Q5_K_M	5	1152 GB	High	Too big
Q6_K	6	1312 GB	High	Too big
Q8_0	8	1712 GB	Very High	Too big
F16	16	3280 GB	Maximum	Too big

VRAM shown is quantized weights only; add ~1–3 GB runtime overhead plus KV cache for your context length. Lower quants trade quality for memory — Q4_K_M is the usual sweet spot; Q2/Q3 only when you must fit a bigger model.

Quality benchmarks

DeepSeek V4 Pro benchmark scores

Benchmark verified

Coding

SWE-bench Verified80.6%

HumanEval+—

Aider Polyglot—

LiveCodeBench93.5%

Reasoning

MMLU-Pro87.5%

GPQA Diamond—

MATH-500—

ARC Challenge—

Source: vendor-reported · 2026-04-24

Hardware compatibility

Fit estimates across all hardware

Open calculator

Computing compatibility...

Memory breakdown

Reference: RTX 2060 6GB

Weights862.0 GB

KV Cache1.9 GB

Runtime0.9 GB

Headroom0.6 GB

Frequently asked questions

FAQ — DeepSeek V4 Pro

How much VRAM does DeepSeek V4 Pro need?

DeepSeek V4 Pro (1600B parameters) requires approximately 865.4 GB of VRAM with NVFP4 quantization. Lower quantizations like Q4_K_M use less memory but may reduce quality.

What is the best quantization for DeepSeek V4 Pro?

The recommended quantization for DeepSeek V4 Pro is NVFP4, which offers the best balance between model quality and memory efficiency. Higher quantizations preserve more quality but require more VRAM.

Is DeepSeek V4 Pro good for reasoning?

Yes, DeepSeek V4 Pro is well-suited for reasoning as well as agentic, coding, long-context. It was designed with these use cases in mind.