Will It Run AI
vllm, multi-gpu, tensor-parallelism, inference, setup-guide, official-docs

vLLM Multi-GPU Setup: How Many GPUs You Actually Need (2026)

Stop guessing --tensor-parallel-size. Exact GPU counts + VRAM per model: Llama 70B = 2×RTX 4090 or 1×H100. Copy-paste configs and a full sizing table.

vLLM is the standard for production LLM serving: continuous batching, PagedAttention, and native tensor parallelism across multiple GPUs — all behind a single flag. This reference covers how to set --tensor-parallel-size, how much total VRAM each model needs, and copy-paste configs for common setups.


Quick Answer: Tensor Parallel Size + VRAM by Model

The most common question is: how do I set --tensor-parallel-size and how many GPUs do I need?

Set --tensor-parallel-size to the number of GPUs you want to use. The model must fit in the combined VRAM of all GPUs. Use this table to pick the right count:

ModelMin VRAM (Q4)--tensor-parallel-size (RTX 4090 24GB)--tensor-parallel-size (H100 80GB)
Llama 3.1 8B~6 GB11
Llama 3.1 70B~42 GB2 (48 GB)1
Qwen 3.5 35B-A3B~21 GB11
Qwen 3.5 122B-A10B~70 GB4 (96 GB)1
Mixtral 8x22B~85 GB4 (96 GB)2
DeepSeek R1 70B~42 GB2 (48 GB)1
DeepSeek R1 671B~380 GBN/A (16×)8
Llama 3.1 405B~230 GBN/A4

Rule of thumb: --tensor-parallel-size must be a power of 2 that divides evenly into the model's attention head count. 1, 2, 4, and 8 work for nearly all models. Odd values (3, 5, 6) often fail.

Copy-paste command for 2× RTX 4090 serving Llama 70B:

vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 2 \
  --dtype bfloat16 \
  --max-model-len 8192

Copy-paste command for 4× GPU serving Qwen 3.5 122B:

vllm serve Qwen/Qwen3.5-122B-A10B-Instruct-GPTQ-Int4 \
  --tensor-parallel-size 4 \
  --quantization gptq \
  --max-model-len 32768

Use the VRAM calculator to verify your specific model + quantization before provisioning hardware.


What is vLLM?

vLLM is a high-throughput LLM inference and serving engine developed at UC Berkeley. It was built to address a fundamental inefficiency in how transformer models manage memory during inference: the KV cache — the attention keys and values accumulated as tokens are generated — was being allocated statically and wastefully in most serving systems.

The core innovation is PagedAttention, which manages KV cache memory the same way an operating system manages virtual memory: in fixed-size pages, allocated dynamically, shared across requests where possible. This eliminates KV cache memory fragmentation and allows vLLM to pack far more concurrent requests into the same GPU memory budget.

On top of PagedAttention, vLLM implements continuous batching (also called in-flight batching). Rather than waiting for an entire batch to finish before starting new requests, it continuously slots new requests into the batch as others complete. The result is dramatically higher GPU utilization compared to static batching, especially for workloads with variable output lengths.

For multi-GPU setups specifically, vLLM's tensor parallelism is first-class — not an afterthought. A single flag, --tensor-parallel-size, splits the model across N GPUs. The communication primitives use NCCL under the hood, which means you get full NVLink bandwidth utilization on NVLink-capable hardware and solid PCIe performance where that is all that is available.


When You Need Multi-GPU vLLM

The decision comes down to two questions: does the model fit on one GPU, and do you need higher throughput than one GPU can deliver?

VRAM requirements drive the first question. If the model does not fit, multi-GPU is not optional. Here is where common models land against single-card capacity limits:

ModelMin VRAM (Q4)Min GPUs (H100 80GB)Min GPUs (RTX 4090 24GB)
Llama 3.1 8B~6 GB
Llama 3.1 70B~42 GB
Mixtral 8x22B~85 GB
Llama 3.1 405B~230 GBN/A
DeepSeek R1 (671B)~380 GBN/A

Llama 3.1 70B sits in an interesting position: it fits on a single H100 80GB at Q4, but requires two RTX 4090s in the consumer segment. Llama 3.1 405B requires 4 H100s minimum; there is no consumer path. DeepSeek R1 at 671B parameters is firmly in the 8-GPU datacenter tier.

Throughput is the second driver. Even when a model fits on one GPU, adding more GPUs increases the tokens-per-second for batch serving. If you are running an API that serves multiple concurrent users, multi-GPU vLLM will handle the load more efficiently than a single GPU — the throughput gain from PagedAttention + tensor parallelism compounds.

Use the VRAM calculator to check whether your specific model and quantization fit your hardware before setting up multi-GPU serving.


Setup Guide

Installation

vLLM requires Python 3.9+ and CUDA 11.8 or later. The standard install:

pip install vllm

For CUDA 12.4 (recommended for Blackwell/Hopper hardware):

pip install vllm --extra-index-url https://download.pytorch.org/whl/cu124

Verify the installation and GPU detection:

python -c "import vllm; print(vllm.__version__)"
nvidia-smi  # Confirm all target GPUs are visible

Basic Multi-GPU Serving

The minimal command to serve a 70B model across two GPUs:

vllm serve meta-llama/Llama-3.1-70B \
  --tensor-parallel-size 2 \
  --dtype float16 \
  --max-model-len 8192

This starts an OpenAI-compatible API server on port 8000. The --tensor-parallel-size 2 flag is the only change from single-GPU serving — vLLM handles everything else automatically.

For gated models, set your Hugging Face token first:

export HF_TOKEN=your_token_here
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 2 \
  --dtype bfloat16 \
  --max-model-len 8192

Key Parameters Explained

Understanding these flags is essential for tuning performance and avoiding OOM errors:

  • --tensor-parallel-size N (alias: -tp N): Splits the model across N GPUs. Must divide evenly into the number of attention heads — most models support 1, 2, 4, or 8. Set this to match the number of GPUs you want to use.

  • --dtype float16 or --dtype bfloat16: Inference precision. Use bfloat16 for Ampere (A100) and newer hardware — it has a wider dynamic range than float16 and avoids certain numerical instability issues in long-context generation. Use float16 for Turing (RTX 20/30xx) and older.

  • --max-model-len: Maximum context length for the model. This directly determines KV cache memory allocation. If you get OOM errors, reduce this first before reducing --gpu-memory-utilization. Cutting from 128K to 8192 can free substantial VRAM.

  • --gpu-memory-utilization 0.90: Fraction of each GPU's VRAM reserved for vLLM (default: 0.90). The remaining 10% is left for the CUDA runtime and model weights not yet pinned. Lower this if you are seeing OOM errors after initial model load.

  • --quantization awq or --quantization gptq: Load a quantized checkpoint. AWQ and GPTQ are the primary quantization formats vLLM supports natively. GGUF is only partially supported — for GGUF-first workflows, llama.cpp remains the better tool.

Advanced: Pipeline Parallelism

For models that do not fit across the GPUs on a single node, or when scaling to very large cluster sizes, vLLM supports pipeline parallelism in addition to tensor parallelism:

vllm serve meta-llama/Llama-3.1-405B \
  --tensor-parallel-size 4 \
  --pipeline-parallel-size 2 \
  --dtype bfloat16

Total GPUs consumed = tensor_parallel_size × pipeline_parallel_size. The command above uses 8 GPUs total: 4-way tensor parallelism across 2 pipeline stages.

Pipeline parallelism splits the model's layers across stages. Stage 0 processes the first half of the transformer layers, stage 1 processes the second half. This introduces pipeline bubble overhead but enables multi-node deployments where each node handles one or more pipeline stages.

For single-node setups with 8 or fewer GPUs, pure tensor parallelism (--tensor-parallel-size 8) is almost always preferred over a hybrid approach. Pipeline parallelism becomes relevant primarily when a single node cannot hold all model shards.

Controlling GPU Visibility

When your server has more GPUs than you want vLLM to use, set CUDA_VISIBLE_DEVICES to restrict which devices are exposed:

# Use only GPUs 0 and 1 out of a 4-GPU machine
CUDA_VISIBLE_DEVICES=0,1 vllm serve meta-llama/Llama-3.1-70B \
  --tensor-parallel-size 2 \
  --dtype bfloat16

This is the cleanest way to run multiple vLLM instances on the same machine, each serving different models on different GPUs.


Performance: NVLink vs PCIe

This is the practical performance ceiling that determines whether multi-GPU vLLM is worth your investment.

Tensor parallelism requires inter-GPU communication after every transformer layer during inference. Each token generation triggers all-reduce operations across all participating GPUs. The bandwidth available for that communication determines how much overhead you pay.

NVLink setups (H100 SXM, A100 SXM, B200 SXM) achieve roughly 0.92x scaling efficiency per card. Two H100 SXM cards running a 70B model deliver approximately 1.84x the throughput of a single H100 serving the same model. The H100 SXM provides 900 GB/s NVLink 4.0 bandwidth — fast enough that synchronization overhead is a small tax.

PCIe setups (RTX 4090, L40S, A100 PCIe) achieve roughly 0.70–0.78x scaling efficiency per card. Two RTX 4090s deliver approximately 1.4–1.5x the throughput of one RTX 4090, not 2x. PCIe Gen4 x16 provides ~32 GB/s bidirectional bandwidth — about 28x less than NVLink 4.0. Every all-reduce operation stalls waiting for inter-GPU communication to complete.

The bandwidth gap in context:

InterconnectBandwidthScaling (2 GPUs)Overhead
NVLink 5.0 (B200/B100)1,800 GB/s~0.93x7%
NVLink 4.0 (H100 SXM)900 GB/s~0.92x8%
AMD Infinity Fabric (MI300X)896 GB/s~0.88x12%
NVLink 3.0 (A100 SXM)600 GB/s~0.90x10%
PCIe Gen4 x16~32 GB/s~0.75x25%

The practical takeaway: PCIe multi-GPU with vLLM is still 3–5x faster than CPU offloading for large models. If you own two RTX 4090s and need to run a 70B model, multi-GPU vLLM is absolutely the right choice. Just do not expect near-linear scaling — budget for ~1.4x throughput, not ~2x.

If you are evaluating a hardware purchase specifically for multi-GPU serving, the NVLink premium on SXM form-factor GPUs pays back quickly in efficiency. The A100 80GB SXM in a multi-GPU configuration consistently outperforms PCIe equivalents beyond what the raw VRAM difference would suggest.


Real-World Configurations

Aggregate throughput estimates below reflect serving workloads with mixed prompt/generation lengths and concurrent requests. Your numbers will vary based on context length, batch concurrency, and model architecture.

SetupTotal VRAMGood ForEst. Throughput
2x RTX 409048 GB70B models (Q4), 13B at FP16~25–30 tok/s
2x H100 SXM160 GB70B (FP16), 405B (Q4)~80–100 tok/s
4x A100 80GB SXM320 GBAll models at Q4, 70B at FP16~60–80 tok/s
8x H100 SXM640 GBAny model at any precision~150+ tok/s
8x MI300X1,536 GBEverything, massive batch sizes~120+ tok/s

A few notes on these numbers:

2x RTX 4090 is a capable consumer multi-GPU setup. The 48GB combined VRAM fits Llama 3.1 70B at Q4 with headroom for context. PCIe bandwidth caps throughput at around 1.4x per card rather than 2x, but this is still a workable production setup for small API deployments. See the 2x RTX 4090 guide for detailed configuration notes.

4x A100 80GB SXM is the sweet spot for research teams and mid-scale inference: enough VRAM to run 405B at Q4, NVLink bandwidth for efficient multi-GPU scaling, and a hardware tier that is widely available in cloud environments.

8x MI300X offers the largest single-node VRAM pool available in 2026. At 192GB per card, eight cards provide 1.5TB of combined VRAM — enough to run nearly any model at FP16. AMD's Infinity Fabric scales well across the MI300X configuration. See the MI300X GPU page for more on its architecture.


Troubleshooting Common Issues

"CUDA out of memory" during startup

vLLM profiles KV cache capacity by filling GPU memory during initialization. If this fails, the model weights plus the requested --max-model-len context budget do not fit.

Fixes in order of preference:

  1. Reduce --max-model-len to a smaller context window (e.g., 4096 instead of 32768)
  2. Lower --gpu-memory-utilization from 0.90 to 0.85
  3. Use a quantized checkpoint with --quantization awq
  4. Add another GPU and increase --tensor-parallel-size

"NCCL timeout" or NCCL-related hangs

This usually indicates that vLLM cannot find all the GPUs it expects, or that inter-GPU connectivity is broken.

Steps to diagnose:

  1. Run nvidia-smi and confirm all target GPUs are visible and healthy
  2. Check CUDA_VISIBLE_DEVICES — if set, ensure it includes exactly the GPUs you intend to use
  3. For multi-node setups, verify that NCCL can establish connections across nodes (firewall rules, network interface binding)
  4. Run nvidia-smi topo -m to inspect GPU topology and confirm NVLink/PCIe connections are as expected

Slow first request

Expected behavior. vLLM performs model profiling and KV cache sizing on the first inference call. Subsequent requests are fast. If your deployment has a cold start problem, send a warmup request after startup before routing real traffic.

Uneven GPU utilization

If one GPU consistently shows higher utilization than others, verify that --tensor-parallel-size exactly matches the number of GPUs you intend to use. Running with --tensor-parallel-size 3 on a 4-GPU machine, for example, will leave one GPU idle while the other three handle the load unevenly due to how attention head splitting works.

Check per-GPU utilization with:

watch -n 1 nvidia-smi

All participating GPUs should show similar memory utilization and similar compute utilization during active inference.


vLLM vs Alternatives for Multi-GPU

Choosing an inference engine for multi-GPU deployment involves tradeoffs across format support, operational complexity, and serving features.

FeaturevLLMllama.cppTGIExLlamaV2
Tensor parallelismNative (--tp N)Manual (--tensor-split)Native (--num-shard)Manual (--gpu-split)
Continuous batchingYesNoYesNo
GGUF supportLimitedFullNoNo
AWQ/GPTQ supportFullPartialFullEXL2 only
OpenAI-compatible APIYesYes (server mode)YesYes
Best forAPI servingLocal/single-userProduction (HF stack)EXL2 quants
Multi-GPU easeExcellentManual tuning requiredGoodManual tuning required

vLLM wins for multi-user API serving at any scale. The combination of PagedAttention and continuous batching makes it dramatically more efficient than static batching approaches when concurrency exceeds a handful of users. The native tensor parallelism with a single flag is a genuine operational advantage over tools that require manual shard configuration.

llama.cpp is better for local single-user inference, GGUF format support, and heterogeneous GPU setups where you have mismatched cards. Its --tensor-split flag gives you precise control over how weight matrices are distributed, which matters when GPU VRAM sizes differ. For developers running models locally on workstations with consumer hardware, llama.cpp often edges out vLLM in single-stream latency due to lower overhead.

TGI (Text Generation Inference) is Hugging Face's production serving stack. It has comparable throughput to vLLM in most benchmarks. The choice between vLLM and TGI usually comes down to ecosystem: if your team is already on the Hugging Face stack and uses Inference Endpoints, TGI is natural. For greenfield deployments, vLLM's broader model support and simpler operator experience tip the balance.

ExLlamaV2 excels at EXL2 quantization — a format with better perplexity-per-bit than GPTQ or AWQ at equivalent model sizes. For users who want the best quality-per-VRAM on consumer hardware and are willing to trade serving efficiency for output quality, ExLlamaV2 is worth evaluating. Its multi-GPU support requires manual --gpu-split configuration rather than a single flag.

For most teams deploying production LLM APIs across multiple GPUs, vLLM is the right default. The tooling has matured, the documentation is thorough, and the OpenAI-compatible API makes it easy to slot into existing infrastructure.


Summary

Multi-GPU vLLM with tensor parallelism comes down to three things: matching --tensor-parallel-size to your GPU count, understanding the NVLink vs PCIe performance ceiling, and tuning --max-model-len and --gpu-memory-utilization to fit your model into available VRAM.

The core setup is deliberately simple — one flag enables the entire feature. The complexity is in the hardware: PCIe-connected consumer GPUs scale to ~1.4x per card rather than 2x, while NVLink datacenter hardware reaches ~1.84x. Planning around those real numbers prevents surprises in production.

For context on the broader multi-GPU inference landscape and per-framework scaling comparisons, see the multi-GPU LLM inference guide. For help checking whether a specific model fits your specific hardware, the VRAM calculator runs the same scoring algorithm used by this guide.

Frequently Asked Questions

How many GPUs does vLLM support?

vLLM supports up to 8 GPUs per node with tensor parallelism, and multiple nodes with pipeline parallelism. The practical limit depends on your interconnect bandwidth. NVLink setups scale efficiently to 8 GPUs; PCIe setups see diminishing returns beyond 4 GPUs.

Can I use different GPU models with vLLM tensor parallelism?

No, vLLM requires identical GPUs for tensor parallelism. All GPUs must have the same VRAM and compute capability. If you have mismatched cards, llama.cpp with --tensor-split is a more flexible alternative.

Does vLLM work with RTX 4090 for multi-GPU?

Yes, set --tensor-parallel-size 2 with two RTX 4090s. Performance is approximately 0.70x scaling due to PCIe bandwidth limitations vs NVLink. You get roughly 1.4x the throughput of a single card, and 48GB combined VRAM unlocks 70B models at Q4 quantization.

What's the minimum GPU count for Llama 405B with vLLM?

At Q4 quantization (~230GB), you need at least 4x H100 80GB or 3x H200 141GB. At FP16 (~810GB), you need 8x H100 or equivalent. Use the --quantization awq flag with a pre-quantized checkpoint to fit on fewer GPUs.

Is vLLM faster than llama.cpp for multi-GPU?

For multi-user API serving, yes — vLLM's continuous batching dramatically improves throughput under concurrent load. For single-user inference, llama.cpp can be comparable or faster due to lower overhead. The gap widens significantly at batch sizes above 4.