Multi-GPU LLM Inference Guide — NVLink vs PCIe, Tensor Parallelism (2026)
Multi-GPU LLM inference: real NVLink vs PCIe scaling numbers, tensor parallelism sizing, 2×/4×/8× GPU configs for Llama 3.3 70B and Qwen 3.5 122B, and exact tokens-per-second across consumer and datacenter hardware.
Multi-GPU inference is one of those topics where the gap between marketing claims and real-world performance is wide. "Double the GPUs, double the performance" sounds intuitive. In practice, the inter-GPU communication overhead, the interconnect technology, and the way tensor parallelism works all chip away at that theoretical 2x. Understanding the actual scaling numbers — not the theoretical ones — is what separates a well-planned multi-GPU setup from an expensive disappointment.
This guide covers the fundamentals: why multi-GPU exists, how NVLink and PCIe scaling factors differ, which configurations unlock which models, and how to think about cost versus capability. All scaling numbers are sourced from the fit engine powering WillItRunAI's VRAM calculator.
Why Multi-GPU?
The simple answer is VRAM. Some models are too large to fit on any single GPU, and no amount of clever quantization changes that.
Look at the numbers. Llama 3.1 405B at Q4 quantization requires approximately 230GB of VRAM. DeepSeek R1 at Q4 needs around 380GB. At FP16 precision — which matters for research and high-accuracy deployments — Llama 405B needs roughly 810GB and DeepSeek R1 needs close to 720GB.
Single GPUs, even the largest datacenter cards available in 2026, max out at 192–288GB per card. NVIDIA's B200 provides 180GB, the B100 provides 192GB, and AMD's MI350X tops out at 288GB. For anything above those thresholds, you are either quantizing aggressively or going multi-GPU.
Multi-GPU inference is not just a workaround for memory constraints. For high-throughput production serving, parallelizing across GPUs can also increase the aggregate tokens-per-second for batch workloads. But the primary driver for most practitioners is simple: the model does not fit on one card.
NVLink vs PCIe: Real Scaling Differences
This is where multi-GPU setups live or die. The interconnect between your GPUs determines how much of the theoretical multi-GPU throughput you actually capture.
Why interconnect matters
Tensor parallelism works by splitting model weight matrices across GPUs. Each card holds a shard of every layer. During inference, each GPU processes its shard in parallel, then an all-reduce operation synchronizes the partial results before the next layer runs. This synchronization happens on every single forward pass — for every token generated.
The bandwidth available for that synchronization directly determines the overhead. High bandwidth means fast synchronization and minimal stall time. Low bandwidth means the GPUs are frequently waiting for partial results to arrive before they can proceed.
The numbers
Our fit engine uses empirically-derived scaling factors for common interconnect configurations:
- NVLink (H100/H200 SXM, A100 SXM): 0.92x — 8% overhead. NVLink 4.0 provides 900 GB/s bidirectional bandwidth per GPU in an 8-GPU NVLink domain. This is fast enough that communication overhead is a minor tax.
- NVLink (B200/B100/GB200): 0.93x — Slightly better at 7% overhead. NVLink 5.0 pushes to 1800 GB/s, reducing synchronization stalls further.
- AMD Infinity Fabric (MI300X, MI325X): 0.88x — 12% overhead. Infinity Fabric at 896 GB/s is slower than NVLink but still a high-speed dedicated interconnect, substantially better than PCIe.
- PCIe (no NVLink): 0.75x — 25% overhead. PCIe Gen5 x16 delivers approximately 64 GB/s, roughly 14x less bandwidth than NVLink 4.0. This gap makes tensor parallelism noticeably more expensive.
- NVLink (A100 SXM): 0.90x — NVLink 3.0 at 600 GB/s. Fast but the older generation shows slightly higher overhead than current NVLink 4.0.
What this means in practice
Two H100 80GB SXM cards connected via NVLink provide 0.92 × 160GB = ~147GB of effective compute throughput, not 160GB. The 8% overhead comes from the synchronization cycles spent on all-reduce operations.
Two L40S 48GB PCIe cards provide 0.75 × 96GB = ~72GB of effective throughput. You are paying for 96GB worth of hardware but getting the effective utilization of 72GB — and you are doing so at slower per-token latency than the NVLink configuration.
The comparison makes the NVLink premium more defensible. When evaluating multi-GPU setups, always calculate the effective VRAM using these scaling factors, not the raw sum.
Practical Configurations
The table below shows real effective VRAM for common configurations alongside representative models each unlocks. Effective VRAM is calculated as raw VRAM × scaling factor.
| Configuration | Raw VRAM | Scaling Factor | Effective VRAM | Key Models Unlocked |
|---|---|---|---|---|
| 2x H100 80GB SXM | 160 GB | 0.92x | ~147 GB | Llama 405B (Q4), Mixtral 8x22B |
| 4x A100 80GB SXM | 320 GB | 0.90x | ~288 GB | DeepSeek R1 (Q4), all 70B at high quant |
| 8x MI300X 192GB | 1,536 GB | 0.88x | ~1,352 GB | Any model at any quantization |
| 2x L40S 48GB PCIe | 96 GB | 0.75x | ~72 GB | Most 70B models at Q4 |
| 8x H100 80GB SXM | 640 GB | 0.92x | ~589 GB | DeepSeek R1 (FP16), Llama 405B (FP16) |
A few things stand out from this data.
First, 2x H100 80GB SXM lands at ~147GB effective, which is tighter than it looks for Llama 3.1 405B at Q4 (~230GB required). That configuration does not actually fit the model — you need to either quantize more aggressively, add more cards, or use a card with more VRAM per device like the H200.
Second, the MI300X at 192GB per card is one of the most VRAM-dense options available. An 8-GPU MI300X cluster provides over 1.3TB of effective VRAM — enough for any current model at FP16 precision with room to spare.
Third, the L40S PCIe configuration's 0.75x scaling factor means two cards are giving you less effective VRAM than a single H100 80GB SXM. The L40S pair costs less, but the value proposition depends heavily on whether the model you need fits in 72GB effective and whether latency matters to you.
Tensor Parallelism in Software
All major inference engines support multi-GPU inference. Software is rarely the bottleneck — the hardware interconnect and configuration management are where the real complexity lies.
vLLM
vLLM has the most mature tensor parallelism implementation for production serving. Launch with:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-405B-Instruct \
--tensor-parallel-size 4
vLLM handles tensor split automatically across all available GPUs specified by CUDA_VISIBLE_DEVICES. Pipeline parallelism (--pipeline-parallel-size) is also available for very large models where tensor parallelism alone is insufficient.
Text Generation Inference (TGI)
TGI uses --num-shard N to specify the number of GPU shards:
text-generation-launcher \
--model-id meta-llama/Llama-3.1-405B-Instruct \
--num-shard 4
TGI is optimized for serving and includes continuous batching, which helps amortize the multi-GPU communication overhead across larger batch sizes.
llama.cpp
llama.cpp supports heterogeneous GPU setups via --tensor-split:
llama-cli \
--model llama-405b.Q4_K_M.gguf \
--tensor-split 1,1,1,1 \
-ngl 999
The --tensor-split values define the relative proportion of layers assigned to each GPU. You can manually weight them to account for different VRAM sizes. This is the most flexible option for non-standard configurations but requires manual tuning for optimal performance.
ExLlamaV2
ExLlamaV2 uses --gpu-split to specify VRAM allocation across GPUs:
python inference.py \
--model-dir ./llama-405b-exl2 \
--gpu-split 80,80,80,80
Values are in gigabytes per GPU. ExLlamaV2's quantization format (EXL2) is specifically designed for fast inference and handles the tensor split efficiently.
All four engines support multi-GPU inference with straightforward configuration. The choice between them is driven by serving requirements (vLLM and TGI for production APIs), ease of use (llama.cpp and Ollama for local setups), and quantization format (ExLlamaV2 for EXL2 models).
Cost Analysis
Before committing to a multi-GPU setup, it is worth pricing out the single-large-card alternatives. The comparison is often closer than expected.
| Option | VRAM | Est. Cloud Cost/hr | Scaling | Best For |
|---|---|---|---|---|
| 1x H200 141GB | 141 GB | ~$4.50/hr | N/A | Most 70B models |
| 2x H100 80GB SXM | 160 GB | ~$6.50/hr | 0.92x | 70B+ at high quant |
| 1x B200 180GB | 180 GB | ~$7.00/hr | N/A | Large models, single card |
| 4x A100 80GB | 320 GB | ~$8.00/hr | 0.90x | Very large models |
| 8x H100 80GB SXM | 640 GB | ~$26.00/hr | 0.92x | FP16 inference, 400B+ |
The H200 141GB single card stands out. At ~$4.50/hr, it covers almost every practical inference workload for models up to Llama 3.1 70B and many 100B+ models at quantized precision. Two H100 80GB SXM cards at ~$6.50/hr give you slightly more raw VRAM (160GB vs 141GB) but add multi-GPU coordination overhead and cost 44% more per hour.
For DeepSeek R1 at Q4 (~380GB), neither single-card option works. You need the 4x A100 80GB configuration at ~$8.00/hr or equivalent. This is the tier where multi-GPU becomes mandatory, not optional.
The 8x H100 configuration at ~$26.00/hr is for a specific use case: FP16 inference on 400B+ models where quantization precision is not acceptable. For research reproducibility, safety evaluations, or deployments where the full model fidelity matters, 640GB effective VRAM at FP16 is the answer.
On-premise vs cloud
Cloud pricing makes the comparison easy. On-premise multi-GPU setups have different economics: high upfront cost, lower per-hour cost at sustained utilization, and operational complexity. An 8x H100 SXM server costs $300,000–500,000 to purchase and requires dedicated rack space, three-phase power, and cooling infrastructure.
For most teams, the right framework is: use cloud for development and infrequent large-model inference, evaluate on-premise only when sustained utilization justifies the capital expenditure.
Multi-GPU-Capable GPUs from Our Catalog
Not all GPUs support multi-GPU inference. Consumer cards (RTX series) generally do not support NVLink in multi-GPU inference configurations. The table below covers the datacenter and workstation cards from our catalog that are designed for multi-GPU setups.
| GPU | Max GPUs | Interconnect | VRAM per GPU | Scaling Factor |
|---|---|---|---|---|
| B200 180GB | 8 | NVLink (1800 GB/s) | 180 GB | 0.93x |
| B100 192GB | 8 | NVLink (1800 GB/s) | 192 GB | 0.93x |
| GB200 192GB | 2 | NVLink (1800 GB/s) | 192 GB | 0.93x |
| H200 141GB SXM | 8 | NVLink (900 GB/s) | 141 GB | 0.92x |
| H100 80GB SXM | 8 | NVLink (900 GB/s) | 80 GB | 0.92x |
| GH200 96GB | 2 | NVLink (900 GB/s) | 96 GB | 0.92x |
| A100 80GB SXM | 8 | NVLink (600 GB/s) | 80 GB | 0.90x |
| MI350X 288GB | 8 | Infinity Fabric (896 GB/s) | 288 GB | 0.90x |
| MI300X 192GB | 8 | Infinity Fabric (896 GB/s) | 192 GB | 0.88x |
| MI325X 256GB | 8 | Infinity Fabric (896 GB/s) | 256 GB | 0.88x |
| A800 80GB | 8 | NVLink (400 GB/s) | 80 GB | 0.88x |
| H800 80GB | 8 | NVLink (400 GB/s) | 80 GB | 0.88x |
| Gaudi 3 128GB | 8 | PCIe | 128 GB | 0.85x |
| MI250X 128GB | 4 | Infinity Fabric (800 GB/s) | 128 GB | 0.85x |
| Max 1550 128GB | 4 | PCIe | 128 GB | 0.80x |
| H100 PCIe 80GB | 4 | PCIe | 80 GB | 0.78x |
| H20 96GB | 8 | PCIe | 96 GB | 0.78x |
| L40S 48GB | 2 | PCIe | 48 GB | 0.75x |
| L40 48GB | 2 | PCIe | 48 GB | 0.75x |
| A40 48GB | 2 | PCIe | 48 GB | 0.75x |
| A30 24GB | 2 | NVLink (200 GB/s) | 24 GB | 0.85x |
A few observations worth highlighting.
The A800 and H800 are Chinese-market variants of the A100 and H100 with reduced NVLink bandwidth (400 GB/s vs 600/900 GB/s). They achieve 0.88x scaling rather than the 0.90x/0.92x of their Western counterparts — a meaningful difference for large model inference that adds up across many inference calls.
The H20, despite being an H100-generation chip, connects via PCIe rather than NVLink and is therefore capped at 0.78x scaling. Its advantage is high VRAM (96GB) per card at a lower price point, not interconnect quality.
The MI350X at 288GB per card with Infinity Fabric and 0.90x scaling is one of the more interesting configurations for large model inference. Eight cards provides 2,304GB raw / ~2,074GB effective — more than enough for any model at full precision.
For cards supporting more than 2 GPUs, the scaling factors in this table represent the per-GPU efficiency in a fully-populated configuration. Efficiency typically decreases slightly as you add more cards due to the increased synchronization overhead in the all-reduce collective.
Heterogeneous GPU Setups: Mostly Avoid Them
llama.cpp's --tensor-split flag makes it technically possible to run inference across GPUs of different models. In practice, this creates more problems than it solves.
The core issue is that the slowest card in a tensor-parallel setup determines the latency for every synchronization operation. If you pair an H100 with an A100, the H100 spends time idle waiting for the A100 to complete its shard and contribute to the all-reduce. The fast card's advantage is partly wasted.
There are niche cases where heterogeneous setups make sense: you have an existing server with mixed cards and need to run a model that does not fit on any single card, or you are evaluating whether to upgrade. In these cases, manual --tensor-split tuning to weight allocations toward the faster card can help. But it is an optimization project, not a straightforward deployment.
For any greenfield infrastructure decision, buy identical cards.
When to Choose Multi-GPU
Multi-GPU is the right answer in specific, well-defined situations. It is not a universal upgrade.
Go multi-GPU when:
- The model you need does not fit on any single available GPU at your required precision. This is the clearest case. Llama 3.1 405B at Q4 (~230GB) requires multiple cards by definition — no single consumer or current datacenter card holds that at Q4.
- You are running DeepSeek R1 at Q4 (~380GB) or at FP16 (~720GB). Same logic: the model mandates multi-GPU.
- You are serving high-throughput batch inference and need aggregate throughput beyond what a single card delivers. Tensor parallelism helps less here than pipeline parallelism or simply running multiple single-card instances, but sometimes the constraint is the model size.
Stay single-GPU when:
- The model fits on a single large-VRAM card with acceptable quantization. A single H200 141GB covers a vast range of use cases including all 70B models at Q4-Q8.
- You are optimizing for latency. Single-GPU inference has lower per-token latency than tensor-parallel multi-GPU because there is no inter-GPU synchronization cost.
- You are doing local development or experimentation. Multi-GPU setups add setup complexity and cost that is not justified for exploratory work.
- Cost is a constraint. A single B200 180GB card at ~$7.00/hr covers more use cases than most people need, with less complexity than any multi-GPU arrangement.
The rule of thumb: reach for multi-GPU when the model does not fit, not before.
Estimating Effective VRAM for Your Configuration
Before committing to any multi-GPU setup, use this formula to estimate whether your target model will fit:
Effective VRAM = (VRAM per card × number of cards) × scaling factor
Then compare against the model's VRAM requirement at your target quantization. Our VRAM calculator does this automatically for any hardware and model combination in our catalog.
Example: You want to run DeepSeek R1 at Q4 on 4x A100 80GB SXM.
Effective VRAM = (80 GB × 4) × 0.90 = 288 GB
DeepSeek R1 Q4 requirement ≈ 380 GB
It does not fit. You need either more cards (6x A100 provides 432GB effective) or a different approach (more aggressive quantization, or switching to MI300X cards with their higher per-card VRAM).
This kind of calculation prevents expensive over-commitment. Run the numbers before provisioning hardware.
Summary
Multi-GPU LLM inference is a capability multiplier when you need it and an expensive complication when you do not.
The key variables:
-
Interconnect quality determines scaling efficiency. NVLink SXM configs achieve 0.90–0.93x scaling. PCIe configs drop to 0.75–0.78x. That 15–20 percentage point gap is real money and real latency.
-
Calculate effective VRAM, not raw VRAM. A 4x L40S 48GB PCIe setup gives you 144GB raw but only 108GB effective. Model fit decisions based on raw VRAM will lead to failures at runtime.
-
Single large-VRAM cards often beat multi-GPU on cost and latency. For workloads that fit on an H200 141GB or MI350X 288GB, those single cards are better on most metrics than multi-GPU alternatives at comparable cost.
-
All major inference engines support multi-GPU. vLLM, TGI, llama.cpp, and ExLlamaV2 all support tensor parallelism with straightforward configuration. Software is not the constraint.
-
Multi-GPU is mandatory above certain model sizes. 400B+ models at FP16 and 300B+ models at Q4 require multi-GPU — there is no single-card alternative at any price.
Use our tools to verify any specific configuration:
- Check if your hardware runs a specific model — includes multi-GPU configurations
- Llama 405B compatibility check — see which GPU setups can run it
- DeepSeek R1 compatibility check — full hardware compatibility breakdown
- DeepSeek R1 model page — quantization options and VRAM requirements
- Llama 3.1 405B model page — full model profile and requirements