Will It Run AI
multi-gpu, rtx-4090, local-inference, nvidia, hardware-guide

2x RTX 4090 for LLMs: What You Can Run, Setup Guide & Real Performance (2026)

Can two RTX 4090s run Llama 70B or Qwen 72B? We break down the real performance of dual 4090 setups for local LLM inference — VRAM pooling, PCIe limitations, and what models you unlock.

A single RTX 4090 is the best consumer GPU money can buy for local AI inference. But 24GB of VRAM has a hard ceiling: models like Llama 3.1 70B, Qwen 2.5 72B, or DeepSeek R1 Distill 70B simply do not fit — not even at aggressive quantization. So the obvious question is: what happens if you put two of them together?

The answer is nuanced. Two 4090s unlock the entire 70B model class and deliver real, usable inference performance. But the PCIe interconnect — not the raw GPU power — is the limiting factor. This guide covers the exact scaling numbers, which models you unlock, how to configure the major inference engines, and whether the ~$3,200 dual setup makes sense versus the alternatives.

If you want the full theoretical background on multi-GPU scaling factors and interconnect technologies, see our multi-GPU inference guide. This article focuses specifically on the 2x RTX 4090 configuration.


Why Two 4090s?

The 4090 is a genuinely exceptional GPU for local AI. It has 82.6 TFLOPS of FP16 compute, 1,008 GB/s memory bandwidth, and 24GB of GDDR6X VRAM. For models up to roughly 30-34B parameters at Q4 quantization, a single 4090 is hard to beat at any consumer price point.

The problem is VRAM, and VRAM alone. Here is the math at Q4_K_M quantization — the most popular quantization level for local inference:

  • Llama 3.1 70B: ~42GB
  • Qwen 2.5 72B: ~43GB
  • DeepSeek R1 Distill 70B: ~42GB
  • Mixtral 8x7B: ~26GB (fits on one 4090, but tightly)

None of the 70B-class models fit in 24GB. You can push some of them with aggressive Q2 quantization, but at a significant quality cost. The better answer is more VRAM.

Two RTX 4090s give you 48GB of combined VRAM. That is enough for every 70B model at Q4_K_M, with headroom for context. You get all the 4090's compute advantages — the fast CUDA cores, the high bandwidth, the Ada Lovelace architecture — just with more memory to work with.

What you do not get is NVLink.


The PCIe Reality

This is the part most dual-4090 discussions gloss over. NVIDIA removed NVLink from consumer GPUs starting with the RTX 40 series (Ada Lovelace). There is no workaround. If you want NVLink on a 4090, it does not exist.

This matters because tensor parallelism — the technique that splits a model across multiple GPUs — requires constant communication between cards. Every time a token is generated, an all-reduce operation must synchronize the partial results from both GPUs before the next layer can run. The speed of that synchronization is determined entirely by the interconnect bandwidth.

The numbers:

  • PCIe Gen4 x16: ~32 GB/s per direction. This is what 2x 4090 runs on.
  • NVLink 4.0 (H100 SXM): ~900 GB/s bidirectional. This is what datacenter setups use.

PCIe Gen4 is approximately 28x slower than NVLink 4.0 for inter-GPU communication. In practice, this translates to a scaling factor of around 0.70x for the 2x 4090 configuration. You have 48GB of VRAM, but you are getting roughly 70% of the throughput that two independent 24GB GPUs could theoretically provide if there were no communication overhead.

In real terms: a 2x 4090 setup running Llama 70B at Q4 can expect approximately 25-30 tokens per second. That is perfectly usable for local inference — fast enough for interactive use and productive for summarization or coding tasks. It is not competitive with a single H100 80GB (which would do this at ~60-70 tok/s), but an H100 costs $25,000-30,000 new.

To put the overhead in perspective: offloading to system RAM is roughly 10-15x slower than even PCIe inter-GPU communication. The PCIe penalty is real but manageable. What makes 2x 4090 practical is that PCIe is still orders of magnitude faster than the alternative of running the model in CPU RAM.


What Models You Unlock

The practical question: what can you run with 48GB that you could not run with 24GB?

ModelSingle 4090 (24GB)2x 4090 (48GB)Est. tok/s (2x)
Llama 3.1 70B (Q4_K_M)No fitFits~25-30 tok/s
Qwen 2.5 72B (Q4_K_M)No fitFits~24-28 tok/s
DeepSeek R1 Distill 70B (Q4)No fitFits~22-26 tok/s
Mixtral 8x7B (Q4_K_M)Tight fit (~26GB)Comfortable~35-40 tok/s
Llama 3.1 8B (Q8)Native fitNative fit~80+ tok/s
Llama 405B (Q4)No fitNo fit (~230GB needed)N/A

A few things stand out.

The 70B-class models — Llama 70B, Qwen 72B, and the DeepSeek R1 distillations — are the main unlock. These are among the most capable open-weight models available. Llama 3.1 70B in particular is competitive with GPT-4 level performance on many benchmarks. Being able to run it locally, privately, and at ~25+ tok/s is genuinely useful.

Mixtral 8x7B technically fits on a single 4090 at Q4 (about 26GB), but it is tight — you have limited context headroom. On 2x 4090 it runs comfortably with full context.

Llama 405B is out of reach even at 48GB. At Q4 it needs approximately 230GB. That requires either cloud infrastructure or professional datacenter hardware. Use our can I run page to check specific model/hardware fit for other configurations.


Setup Guide: Software Options

Getting two 4090s working for inference is straightforward with modern inference engines. The hard part is buying the hardware and managing the heat — the software is mostly automatic.

vLLM (Recommended for throughput and API serving)

vLLM has the most polished tensor parallelism implementation and is ideal if you want to serve an OpenAI-compatible API from your dual-4090 machine:

pip install vllm
vllm serve meta-llama/Llama-3.1-70B --tensor-parallel-size 2 --dtype float16

Set --tensor-parallel-size 2 and vLLM handles the rest automatically — it detects both GPUs, splits the model layers, and manages the all-reduce communication. Continuous batching means multiple simultaneous requests amortize the PCIe overhead more efficiently than single-request inference. If you are building a local AI server that multiple people use, vLLM is the right starting point.

llama.cpp (Best for single-user local inference)

For personal use — chatting with a local model, running inference for your own scripts — llama.cpp is the most efficient option. Lower overhead than vLLM, works directly with GGUF quantized files:

./llama-server -m llama-70b-q4_k_m.gguf --tensor-split 12,12 -ngl 99

--tensor-split 12,12 distributes layers evenly across both GPUs. -ngl 99 offloads all layers to GPU (no CPU fallback). If your two 4090s have exactly equal VRAM (they will), equal splits are optimal. This launches an HTTP server on port 8080 with an OpenAI-compatible API.

For direct CLI inference without the server:

./llama-cli -m llama-70b-q4_k_m.gguf --tensor-split 12,12 -ngl 99 -p "Your prompt here"

Ollama (Easiest setup)

If you want zero configuration and do not need to tune anything, Ollama is the answer. Since v0.4, it automatically detects and uses multiple GPUs:

ollama run llama3.1:70b

That is it. Ollama handles GPU detection, tensor splitting, and layer distribution automatically using its llama.cpp backend. The trade-off is less control over configuration and slightly higher overhead than raw llama.cpp. For most people starting out with a dual-4090 setup, this is the right first step.

ExLlamaV2

For users who prefer EXL2 quantized models — which offer excellent quality-to-size ratios — ExLlamaV2 provides fine-grained GPU memory control:

python server.py --model llama-70b-exl2 --gpu-split 12,12

--gpu-split 12,12 allocates 12GB to each GPU. ExLlamaV2 is among the fastest inference engines available for EXL2 format models and handles the inter-GPU communication efficiently. If you are comfortable with Python tooling and want to optimize for maximum tokens per second, ExLlamaV2 is worth the extra setup.


Hardware Setup Tips

The software is easy. Getting two RTX 4090s physically working together takes a bit more planning.

PCIe slot selection matters. Both GPUs need full PCIe Gen4 x16 bandwidth. Check your motherboard manual — many boards have x16 physical slots that run electrically at x8 when both slots are populated. An x8 slot cuts your PCIe bandwidth in half, which is bad for inter-GPU communication. Look for boards with two x16 slots that maintain full bandwidth simultaneously, typically found on HEDT (high-end desktop) platforms like AMD TRX50 or Intel X299, or enthusiast Z790 boards that explicitly support x16/x16 bifurcation.

Power supply. Each RTX 4090 can draw up to 450W at peak load during inference. Two of them plus the rest of the system pushes total system power toward 1,100-1,200W. A quality 1,200W or 1,600W PSU is not optional — it is a safety requirement. A single 850W PSU powering two 4090s will hit its limits, potentially causing crashes or hardware damage under sustained inference workloads.

Thermal management. Two RTX 4090s generate a substantial amount of heat in an enclosed case. The GPUs run hot (up to 83°C junction temperature), and the second card typically runs hotter than the first because it receives pre-heated air. Make sure your case has strong positive or negative pressure airflow, ideally with front intake fans pushing cool air directly over both cards. Open-air test bench setups are often used by AI workstation builders specifically to address this problem. Expect higher noise levels at sustained inference workloads compared to a single-card setup.

Case and spacing. Both RTX 4090 cards are triple-slot designs. Two triple-slot GPUs occupy six expansion slots. Standard ATX mid-tower cases typically have seven slots total, leaving only one slot of clearance above the top card. Full-tower cases are strongly recommended to allow airflow between and around the cards. Some builders opt for full-ATX open-frame cases designed for multi-GPU setups.

System RAM. With two 4090s and the intention to run 70B models, 64GB of system RAM is the recommended minimum. The model lives in VRAM, but the system needs RAM for the operating system, inference engine overhead, and any additional context management. 32GB is workable but tight.


Cost Analysis

The dual-4090 setup makes the most sense in the context of alternatives. Here is where it fits:

OptionVRAMEst. CostPerformance for 70B
1x RTX 409024GB~$1,600No fit at Q4
2x RTX 409048GB eff.~$3,200~25-30 tok/s
1x RTX 509032GB~$2,000No fit at Q4 (tight at Q3)
1x RTX A6000 48GB48GB~$4,500~35-40 tok/s (single card)
1x NVIDIA L40S48GB~$8,000~40-45 tok/s (single card)
Cloud H100 (1 hr)80GB~$3.25/hr~60-70 tok/s

The case for 2x 4090 is specific but clear: if you want to run 70B-class models locally, it is the cheapest entry point by a significant margin. The RTX A6000 48GB would give you single-card simplicity at the same VRAM level but costs ~40% more and delivers comparable tokens-per-second (no PCIe overhead, but slower raw compute than two 4090s).

The RTX 5090 at 32GB is a meaningful upgrade over a single 4090 for sub-30B models, but it still cannot fit Llama 70B at Q4_K_M. For the specific use case of 70B inference, it is an expensive partial solution. If your primary goal is running the best sub-30B models as fast as possible, the 5090 is excellent. If you need 70B, you need more VRAM.

Cloud inference at $3.25/hr for an H100 is compelling for infrequent use. At typical inference workloads of a few hours per week, that adds up to $15-20/month — competitive with the amortized cost of the hardware over a few years. The case for 2x 4090 over cloud is privacy, latency (no network round-trip), and predictable cost when you are a heavy user.


Should You Buy a Second 4090?

The honest answer depends entirely on what you want to run.

A second 4090 makes strong sense if:

  • You already own one 4090 and want to run 70B models without paying for cloud compute
  • You are building a local AI workstation specifically for large model inference and want the best value per GB at the consumer tier
  • You value data privacy and want full on-premise inference for sensitive use cases
  • You are willing to manage the power and thermal overhead of a ~900W GPU system

A second 4090 probably does not make sense if:

  • You primarily run models up to 34B — a single 4090 handles everything in that range and the second card provides no benefit
  • You do not have a full-tower case, strong airflow, and a 1,200W+ PSU already — the supporting hardware costs add up
  • You are a light user — cloud inference is cheaper below a few hours of weekly use
  • You want to run 70B models but already have budget for a used H100 PCIe (which handles 70B better as a single card with no PCIe overhead)

For the right user — someone already deep in the 4090 ecosystem who needs 70B capabilities — the second card is one of the best value adds in AI hardware. The VRAM calculator can help you verify whether any specific model and quantization level will fit in your target configuration.


The Bottom Line

Two RTX 4090s form a genuine powerhouse for local AI inference. You get 48GB of effective VRAM, enough to run every 70B model at practical quantization levels, delivered by some of the fastest per-core compute available on consumer silicon.

The PCIe limitation is real — 0.70x scaling efficiency means you are not getting 2x the performance of one card, more like 1.4x. But 1.4x applied to a 4090 still means 25-30 tok/s on Llama 70B Q4, which is fast enough for every interactive workload. The configuration is well-supported by every major inference engine, and Ollama makes initial setup trivially easy.

Against the realistic alternatives — spending $4,500+ on professional single-GPU cards or paying cloud rates for H100 access — 2x 4090 at ~$3,200 is a defensible choice for any serious local AI user who has landed on 70B models as their target tier.

Check specific model compatibility on our model pages or use the VRAM calculator to see exactly what fits in your setup.

Frequently Asked Questions

Can two RTX 4090s run Llama 70B?

Yes. Two RTX 4090s provide 48GB of combined VRAM. Llama 3.1 70B at Q4_K_M quantization requires approximately 42GB, which fits comfortably. With PCIe tensor parallelism, you can expect around 25-30 tokens per second — usable for local inference, though slower than a dedicated datacenter GPU with NVLink.

Do RTX 4090s support NVLink?

No. NVIDIA removed NVLink from consumer GPUs starting with the Ada Lovelace generation (RTX 40 series). Multi-GPU 4090 setups communicate over PCIe Gen4, which provides approximately 32 GB/s per direction — around 28x less bandwidth than H100 SXM NVLink. This results in roughly 0.70x scaling efficiency (30% overhead) for tensor-parallel inference.

Is 2x RTX 4090 worth it vs buying one RTX 5090?

For AI inference specifically, yes — 2x RTX 4090 (48GB combined) is significantly more capable than a single RTX 5090 (32GB) for running large models. The RTX 5090 cannot fit Llama 70B at Q4 without offloading, while 2x 4090 handles the entire 70B class comfortably. The trade-off is higher power draw (900W combined vs ~600W) and more complex setup.

What's the performance penalty of 2x RTX 4090 vs a single 80GB GPU?

About 30% throughput loss due to PCIe communication overhead versus a single-card setup. A single NVIDIA H100 80GB PCIe would be roughly 40% faster at the same model due to no inter-GPU communication cost and better memory bandwidth per GB. However, the H100 costs around $30,000 new — approximately 19x the price of a used 4090.

Can I use 3x or 4x RTX 4090?

Technically yes with vLLM or llama.cpp. However, PCIe bandwidth becomes a severe bottleneck beyond 2 GPUs, and diminishing returns are significant. Most consumer motherboards also lack three or four x16 PCIe slots with adequate spacing for triple-slot 4090 cards. For most users, 2x 4090 is the practical maximum. Beyond that, consider professional-grade alternatives.

Which is better for 2x 4090: vLLM or llama.cpp?

vLLM is better if you are serving an API or need to handle multiple concurrent requests — its continuous batching amortizes the PCIe overhead across larger batch sizes. llama.cpp is better for single-user local inference: lower overhead, easier to configure, and supports GGUF quantized models directly. Ollama (which uses llama.cpp underneath) is the easiest entry point for beginners.