Will It Run AI
multi-gpu, hardware, inference, nvidia, amd, datacenter, nvlink

Multi-GPU LLM Inference Guide — NVLink vs PCIe, Tensor Parallelism (2026)

Multi-GPU LLM inference: real NVLink vs PCIe scaling numbers, tensor parallelism sizing, 2×/4×/8× GPU configs for Llama 3.3 70B and Qwen 3.5 122B, and exact tokens-per-second across consumer and datacenter hardware.

Multi-GPU inference is one of those topics where the gap between marketing claims and real-world performance is wide. "Double the GPUs, double the performance" sounds intuitive. In practice, the inter-GPU communication overhead, the interconnect technology, and the way tensor parallelism works all chip away at that theoretical 2x. Understanding the actual scaling numbers — not the theoretical ones — is what separates a well-planned multi-GPU setup from an expensive disappointment.

This guide covers the fundamentals: why multi-GPU exists, how NVLink and PCIe scaling factors differ, which configurations unlock which models, and how to think about cost versus capability. All scaling numbers are sourced from the fit engine powering WillItRunAI's VRAM calculator.


Why Multi-GPU?

The simple answer is VRAM. Some models are too large to fit on any single GPU, and no amount of clever quantization changes that.

Look at the numbers. Llama 3.1 405B at Q4 quantization requires approximately 230GB of VRAM. DeepSeek R1 at Q4 needs around 380GB. At FP16 precision — which matters for research and high-accuracy deployments — Llama 405B needs roughly 810GB and DeepSeek R1 needs close to 720GB.

Single GPUs, even the largest datacenter cards available in 2026, max out at 192–288GB per card. NVIDIA's B200 provides 180GB, the B100 provides 192GB, and AMD's MI350X tops out at 288GB. For anything above those thresholds, you are either quantizing aggressively or going multi-GPU.

Multi-GPU inference is not just a workaround for memory constraints. For high-throughput production serving, parallelizing across GPUs can also increase the aggregate tokens-per-second for batch workloads. But the primary driver for most practitioners is simple: the model does not fit on one card.


NVLink vs PCIe: Real Scaling Differences

This is where multi-GPU setups live or die. The interconnect between your GPUs determines how much of the theoretical multi-GPU throughput you actually capture.

Why interconnect matters

Tensor parallelism works by splitting model weight matrices across GPUs. Each card holds a shard of every layer. During inference, each GPU processes its shard in parallel, then an all-reduce operation synchronizes the partial results before the next layer runs. This synchronization happens on every single forward pass — for every token generated.

The bandwidth available for that synchronization directly determines the overhead. High bandwidth means fast synchronization and minimal stall time. Low bandwidth means the GPUs are frequently waiting for partial results to arrive before they can proceed.

The numbers

Our fit engine uses empirically-derived scaling factors for common interconnect configurations:

  • NVLink (H100/H200 SXM, A100 SXM): 0.92x — 8% overhead. NVLink 4.0 provides 900 GB/s bidirectional bandwidth per GPU in an 8-GPU NVLink domain. This is fast enough that communication overhead is a minor tax.
  • NVLink (B200/B100/GB200): 0.93x — Slightly better at 7% overhead. NVLink 5.0 pushes to 1800 GB/s, reducing synchronization stalls further.
  • AMD Infinity Fabric (MI300X, MI325X): 0.88x — 12% overhead. Infinity Fabric at 896 GB/s is slower than NVLink but still a high-speed dedicated interconnect, substantially better than PCIe.
  • PCIe (no NVLink): 0.75x — 25% overhead. PCIe Gen5 x16 delivers approximately 64 GB/s, roughly 14x less bandwidth than NVLink 4.0. This gap makes tensor parallelism noticeably more expensive.
  • NVLink (A100 SXM): 0.90x — NVLink 3.0 at 600 GB/s. Fast but the older generation shows slightly higher overhead than current NVLink 4.0.

What this means in practice

Two H100 80GB SXM cards connected via NVLink provide 0.92 × 160GB = ~147GB of effective compute throughput, not 160GB. The 8% overhead comes from the synchronization cycles spent on all-reduce operations.

Two L40S 48GB PCIe cards provide 0.75 × 96GB = ~72GB of effective throughput. You are paying for 96GB worth of hardware but getting the effective utilization of 72GB — and you are doing so at slower per-token latency than the NVLink configuration.

The comparison makes the NVLink premium more defensible. When evaluating multi-GPU setups, always calculate the effective VRAM using these scaling factors, not the raw sum.


Practical Configurations

The table below shows real effective VRAM for common configurations alongside representative models each unlocks. Effective VRAM is calculated as raw VRAM × scaling factor.

ConfigurationRaw VRAMScaling FactorEffective VRAMKey Models Unlocked
2x H100 80GB SXM160 GB0.92x~147 GBLlama 405B (Q4), Mixtral 8x22B
4x A100 80GB SXM320 GB0.90x~288 GBDeepSeek R1 (Q4), all 70B at high quant
8x MI300X 192GB1,536 GB0.88x~1,352 GBAny model at any quantization
2x L40S 48GB PCIe96 GB0.75x~72 GBMost 70B models at Q4
8x H100 80GB SXM640 GB0.92x~589 GBDeepSeek R1 (FP16), Llama 405B (FP16)

A few things stand out from this data.

First, 2x H100 80GB SXM lands at ~147GB effective, which is tighter than it looks for Llama 3.1 405B at Q4 (~230GB required). That configuration does not actually fit the model — you need to either quantize more aggressively, add more cards, or use a card with more VRAM per device like the H200.

Second, the MI300X at 192GB per card is one of the most VRAM-dense options available. An 8-GPU MI300X cluster provides over 1.3TB of effective VRAM — enough for any current model at FP16 precision with room to spare.

Third, the L40S PCIe configuration's 0.75x scaling factor means two cards are giving you less effective VRAM than a single H100 80GB SXM. The L40S pair costs less, but the value proposition depends heavily on whether the model you need fits in 72GB effective and whether latency matters to you.


Tensor Parallelism in Software

All major inference engines support multi-GPU inference. Software is rarely the bottleneck — the hardware interconnect and configuration management are where the real complexity lies.

vLLM

vLLM has the most mature tensor parallelism implementation for production serving. Launch with:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-405B-Instruct \
  --tensor-parallel-size 4

vLLM handles tensor split automatically across all available GPUs specified by CUDA_VISIBLE_DEVICES. Pipeline parallelism (--pipeline-parallel-size) is also available for very large models where tensor parallelism alone is insufficient.

Text Generation Inference (TGI)

TGI uses --num-shard N to specify the number of GPU shards:

text-generation-launcher \
  --model-id meta-llama/Llama-3.1-405B-Instruct \
  --num-shard 4

TGI is optimized for serving and includes continuous batching, which helps amortize the multi-GPU communication overhead across larger batch sizes.

llama.cpp

llama.cpp supports heterogeneous GPU setups via --tensor-split:

llama-cli \
  --model llama-405b.Q4_K_M.gguf \
  --tensor-split 1,1,1,1 \
  -ngl 999

The --tensor-split values define the relative proportion of layers assigned to each GPU. You can manually weight them to account for different VRAM sizes. This is the most flexible option for non-standard configurations but requires manual tuning for optimal performance.

ExLlamaV2

ExLlamaV2 uses --gpu-split to specify VRAM allocation across GPUs:

python inference.py \
  --model-dir ./llama-405b-exl2 \
  --gpu-split 80,80,80,80

Values are in gigabytes per GPU. ExLlamaV2's quantization format (EXL2) is specifically designed for fast inference and handles the tensor split efficiently.

All four engines support multi-GPU inference with straightforward configuration. The choice between them is driven by serving requirements (vLLM and TGI for production APIs), ease of use (llama.cpp and Ollama for local setups), and quantization format (ExLlamaV2 for EXL2 models).


Cost Analysis

Before committing to a multi-GPU setup, it is worth pricing out the single-large-card alternatives. The comparison is often closer than expected.

OptionVRAMEst. Cloud Cost/hrScalingBest For
1x H200 141GB141 GB~$4.50/hrN/AMost 70B models
2x H100 80GB SXM160 GB~$6.50/hr0.92x70B+ at high quant
1x B200 180GB180 GB~$7.00/hrN/ALarge models, single card
4x A100 80GB320 GB~$8.00/hr0.90xVery large models
8x H100 80GB SXM640 GB~$26.00/hr0.92xFP16 inference, 400B+

The H200 141GB single card stands out. At ~$4.50/hr, it covers almost every practical inference workload for models up to Llama 3.1 70B and many 100B+ models at quantized precision. Two H100 80GB SXM cards at ~$6.50/hr give you slightly more raw VRAM (160GB vs 141GB) but add multi-GPU coordination overhead and cost 44% more per hour.

For DeepSeek R1 at Q4 (~380GB), neither single-card option works. You need the 4x A100 80GB configuration at ~$8.00/hr or equivalent. This is the tier where multi-GPU becomes mandatory, not optional.

The 8x H100 configuration at ~$26.00/hr is for a specific use case: FP16 inference on 400B+ models where quantization precision is not acceptable. For research reproducibility, safety evaluations, or deployments where the full model fidelity matters, 640GB effective VRAM at FP16 is the answer.

On-premise vs cloud

Cloud pricing makes the comparison easy. On-premise multi-GPU setups have different economics: high upfront cost, lower per-hour cost at sustained utilization, and operational complexity. An 8x H100 SXM server costs $300,000–500,000 to purchase and requires dedicated rack space, three-phase power, and cooling infrastructure.

For most teams, the right framework is: use cloud for development and infrequent large-model inference, evaluate on-premise only when sustained utilization justifies the capital expenditure.


Multi-GPU-Capable GPUs from Our Catalog

Not all GPUs support multi-GPU inference. Consumer cards (RTX series) generally do not support NVLink in multi-GPU inference configurations. The table below covers the datacenter and workstation cards from our catalog that are designed for multi-GPU setups.

GPUMax GPUsInterconnectVRAM per GPUScaling Factor
B200 180GB8NVLink (1800 GB/s)180 GB0.93x
B100 192GB8NVLink (1800 GB/s)192 GB0.93x
GB200 192GB2NVLink (1800 GB/s)192 GB0.93x
H200 141GB SXM8NVLink (900 GB/s)141 GB0.92x
H100 80GB SXM8NVLink (900 GB/s)80 GB0.92x
GH200 96GB2NVLink (900 GB/s)96 GB0.92x
A100 80GB SXM8NVLink (600 GB/s)80 GB0.90x
MI350X 288GB8Infinity Fabric (896 GB/s)288 GB0.90x
MI300X 192GB8Infinity Fabric (896 GB/s)192 GB0.88x
MI325X 256GB8Infinity Fabric (896 GB/s)256 GB0.88x
A800 80GB8NVLink (400 GB/s)80 GB0.88x
H800 80GB8NVLink (400 GB/s)80 GB0.88x
Gaudi 3 128GB8PCIe128 GB0.85x
MI250X 128GB4Infinity Fabric (800 GB/s)128 GB0.85x
Max 1550 128GB4PCIe128 GB0.80x
H100 PCIe 80GB4PCIe80 GB0.78x
H20 96GB8PCIe96 GB0.78x
L40S 48GB2PCIe48 GB0.75x
L40 48GB2PCIe48 GB0.75x
A40 48GB2PCIe48 GB0.75x
A30 24GB2NVLink (200 GB/s)24 GB0.85x

A few observations worth highlighting.

The A800 and H800 are Chinese-market variants of the A100 and H100 with reduced NVLink bandwidth (400 GB/s vs 600/900 GB/s). They achieve 0.88x scaling rather than the 0.90x/0.92x of their Western counterparts — a meaningful difference for large model inference that adds up across many inference calls.

The H20, despite being an H100-generation chip, connects via PCIe rather than NVLink and is therefore capped at 0.78x scaling. Its advantage is high VRAM (96GB) per card at a lower price point, not interconnect quality.

The MI350X at 288GB per card with Infinity Fabric and 0.90x scaling is one of the more interesting configurations for large model inference. Eight cards provides 2,304GB raw / ~2,074GB effective — more than enough for any model at full precision.

For cards supporting more than 2 GPUs, the scaling factors in this table represent the per-GPU efficiency in a fully-populated configuration. Efficiency typically decreases slightly as you add more cards due to the increased synchronization overhead in the all-reduce collective.


Heterogeneous GPU Setups: Mostly Avoid Them

llama.cpp's --tensor-split flag makes it technically possible to run inference across GPUs of different models. In practice, this creates more problems than it solves.

The core issue is that the slowest card in a tensor-parallel setup determines the latency for every synchronization operation. If you pair an H100 with an A100, the H100 spends time idle waiting for the A100 to complete its shard and contribute to the all-reduce. The fast card's advantage is partly wasted.

There are niche cases where heterogeneous setups make sense: you have an existing server with mixed cards and need to run a model that does not fit on any single card, or you are evaluating whether to upgrade. In these cases, manual --tensor-split tuning to weight allocations toward the faster card can help. But it is an optimization project, not a straightforward deployment.

For any greenfield infrastructure decision, buy identical cards.


When to Choose Multi-GPU

Multi-GPU is the right answer in specific, well-defined situations. It is not a universal upgrade.

Go multi-GPU when:

  • The model you need does not fit on any single available GPU at your required precision. This is the clearest case. Llama 3.1 405B at Q4 (~230GB) requires multiple cards by definition — no single consumer or current datacenter card holds that at Q4.
  • You are running DeepSeek R1 at Q4 (~380GB) or at FP16 (~720GB). Same logic: the model mandates multi-GPU.
  • You are serving high-throughput batch inference and need aggregate throughput beyond what a single card delivers. Tensor parallelism helps less here than pipeline parallelism or simply running multiple single-card instances, but sometimes the constraint is the model size.

Stay single-GPU when:

  • The model fits on a single large-VRAM card with acceptable quantization. A single H200 141GB covers a vast range of use cases including all 70B models at Q4-Q8.
  • You are optimizing for latency. Single-GPU inference has lower per-token latency than tensor-parallel multi-GPU because there is no inter-GPU synchronization cost.
  • You are doing local development or experimentation. Multi-GPU setups add setup complexity and cost that is not justified for exploratory work.
  • Cost is a constraint. A single B200 180GB card at ~$7.00/hr covers more use cases than most people need, with less complexity than any multi-GPU arrangement.

The rule of thumb: reach for multi-GPU when the model does not fit, not before.


Estimating Effective VRAM for Your Configuration

Before committing to any multi-GPU setup, use this formula to estimate whether your target model will fit:

Effective VRAM = (VRAM per card × number of cards) × scaling factor

Then compare against the model's VRAM requirement at your target quantization. Our VRAM calculator does this automatically for any hardware and model combination in our catalog.

Example: You want to run DeepSeek R1 at Q4 on 4x A100 80GB SXM.

Effective VRAM = (80 GB × 4) × 0.90 = 288 GB
DeepSeek R1 Q4 requirement ≈ 380 GB

It does not fit. You need either more cards (6x A100 provides 432GB effective) or a different approach (more aggressive quantization, or switching to MI300X cards with their higher per-card VRAM).

This kind of calculation prevents expensive over-commitment. Run the numbers before provisioning hardware.


Summary

Multi-GPU LLM inference is a capability multiplier when you need it and an expensive complication when you do not.

The key variables:

  1. Interconnect quality determines scaling efficiency. NVLink SXM configs achieve 0.90–0.93x scaling. PCIe configs drop to 0.75–0.78x. That 15–20 percentage point gap is real money and real latency.

  2. Calculate effective VRAM, not raw VRAM. A 4x L40S 48GB PCIe setup gives you 144GB raw but only 108GB effective. Model fit decisions based on raw VRAM will lead to failures at runtime.

  3. Single large-VRAM cards often beat multi-GPU on cost and latency. For workloads that fit on an H200 141GB or MI350X 288GB, those single cards are better on most metrics than multi-GPU alternatives at comparable cost.

  4. All major inference engines support multi-GPU. vLLM, TGI, llama.cpp, and ExLlamaV2 all support tensor parallelism with straightforward configuration. Software is not the constraint.

  5. Multi-GPU is mandatory above certain model sizes. 400B+ models at FP16 and 300B+ models at Q4 require multi-GPU — there is no single-card alternative at any price.

Use our tools to verify any specific configuration:

Frequently Asked Questions

How many GPUs do I need to run Llama 405B?

At Q4 quantization, Llama 405B requires approximately 230GB of VRAM. Two H100 80GB SXM cards (160GB effective with 0.92x NVLink scaling) are not quite enough — you need four A100 80GB SXM cards (320GB effective) or two H200 141GB cards. At FP16 precision, you need at least eight H100 80GB SXM cards (640GB effective).

Is NVLink required for multi-GPU LLM inference?

No, NVLink is not required, but it significantly affects scaling efficiency. PCIe-connected GPUs operate at 0.75x scaling (25% overhead), while NVLink-connected GPUs reach 0.92x (8% overhead). For latency-sensitive inference, NVLink makes a meaningful difference. For throughput-focused batch workloads with large batch sizes, PCIe setups are more workable.

How does tensor parallelism work for LLM inference?

Tensor parallelism splits the model's weight matrices across multiple GPUs. Each GPU holds a shard of each layer, and during the forward pass they compute their portion in parallel and synchronize partial results via all-reduce operations. This communication is why interconnect bandwidth (NVLink vs PCIe) is so critical — every generated token requires inter-GPU synchronization.

Can I mix different GPU models for multi-GPU inference?

Technically yes — llama.cpp's --tensor-split flag supports heterogeneous GPU setups. In practice, mixing GPU models creates complexity: the slower card becomes the bottleneck for every synchronization step, you must manually tune tensor splits, and the performance gap between cards wastes the faster GPU's capacity. Homogeneous setups are strongly preferred.

What's the performance penalty of multi-GPU inference?

The penalty depends on interconnect quality. NVLink setups (H100 SXM, A100 SXM) lose about 8% of theoretical peak throughput from inter-GPU communication overhead. PCIe setups lose approximately 25%. This means 2x NVLink GPUs deliver roughly 1.84x the throughput of one card (not 2x), while 2x PCIe GPUs deliver around 1.5x.

Is multi-GPU inference worth the cost vs a bigger single GPU?

For most workloads, a single large-VRAM GPU is better value. An H200 141GB single card often beats 2x H100 80GB PCIe on both cost and latency. The case for multi-GPU is clear when: (1) the model does not fit on any single card (400B+ models at FP16), or (2) you need to serve high-throughput batch workloads and own the hardware already. For cloud inference, always price out single-large-card options first.