Ollama Multi-GPU Support: Official Behavior, Layer Split & Setup (2026)
How Ollama officially handles multiple GPUs: auto-detection, layer distribution, num_gpu parameter, CUDA_VISIBLE_DEVICES, and real 2× RTX 4090 / 2× RTX 3090 performance numbers for Llama 70B and Qwen 3.5.
If you already use Ollama to run local AI models, adding a second GPU is one of the best upgrades you can make. It doubles your available VRAM, unlocking 70B-class models that simply will not fit on a single consumer card — and the setup is remarkably simple. Ollama handles multi-GPU automatically.
This guide explains how Ollama's multi-GPU support works, what models become possible with two GPUs, and the handful of environment variables worth knowing about. Whether you have two RTX 4090s, a mixed pair, or are planning your first second-GPU purchase, here is everything you need to know.
Ollama + Multi-GPU: The Easiest Path
Ollama is the most popular tool for running large language models locally, and for good reason: it is simple. Download it, run ollama run llama3.1:8b, and you have a working local AI within minutes. No Docker containers, no Python environments, no configuration files.
The same simplicity extends to multi-GPU. Since Ollama v0.4, released in late 2024, it automatically detects every NVIDIA or AMD GPU in your system and distributes model layers across all of them. You do not configure anything. You do not pass flags. You just run your model, and Ollama figures out the rest.
Under the hood, Ollama uses llama.cpp's tensor splitting mechanism, which partitions model layers across available GPUs and handles the inter-GPU communication during inference. This is the same approach that powers more advanced multi-GPU tools — Ollama just does all the configuration for you automatically.
The result: if a 70B model requires 40GB of VRAM and you have two 24GB GPUs, Ollama sees 48GB of combined VRAM and loads the model. No extra steps required.
How Ollama Uses Multiple GPUs
When Ollama starts up, it enumerates all available GPUs using CUDA (for NVIDIA) or ROCm (for AMD). It then makes allocation decisions based on each GPU's available VRAM.
Here is what happens when you run a large model:
- GPU detection: Ollama scans for all NVIDIA and AMD GPUs via their respective runtime APIs. All detected GPUs are candidates for use.
- Layer distribution: The model's transformer layers are split across the available GPUs. Ollama uses llama.cpp's
--tensor-splitmechanism internally, proportionally weighted by each GPU's VRAM capacity. - Forward pass: During inference, each GPU processes its assigned layers. When the computation moves between GPUs (layer boundaries), intermediate activations are transferred via PCIe. This inter-GPU communication is the main source of overhead compared to a single large GPU.
- Mixed GPU support: If your GPUs have different VRAM amounts (say, a 24GB card and a 16GB card), Ollama weights the layer split accordingly — roughly 60% to the larger card and 40% to the smaller one.
The key limitation is PCIe bandwidth. Consumer GPUs communicate over PCIe rather than NVLink, and PCIe provides roughly 14x less bandwidth than NVLink 4.0. This is why the multi-GPU inference guide cites a 0.75x scaling factor for PCIe setups: you capture about 75% of the theoretical 2x throughput, not the full double. But that still represents a massive improvement over running models partially on CPU, which Ollama falls back to when VRAM is insufficient.
What Models You Can Run with 2 GPUs
The practical question: what gets unlocked? Here is a summary of common two-GPU configurations and the models they enable.
| Your Setup | Total VRAM | Models You Unlock |
|---|---|---|
| 2x RTX 4090 (24GB each) | 48 GB | Llama 70B (Q4), Qwen 72B, DeepSeek R1 Distill 70B |
| 2x RTX 3090 (24GB each) | 48 GB | Same as 2x 4090 but ~30% slower decode |
| 2x RTX 4080 Super (16GB each) | 32 GB | Llama 70B (Q2), Mixtral 8x7B (Q4) |
| RTX 4090 + RTX 3090 (mixed) | 48 GB | Llama 70B (Q4) — Ollama handles mixed GPUs |
| 2x RTX 5090 (32GB each) | 64 GB | Llama 70B (Q8), larger context windows |
A few things to note about this table:
The 4090 vs 3090 difference at 2x is meaningful. Two RTX 3090s give you identical VRAM to two RTX 4090s, so you can load the same models. But the 4090 has significantly higher memory bandwidth (1008 GB/s vs 936 GB/s) and better throughput per layer, which adds up to a real performance gap on 70B inference.
Mixed GPUs work but come with trade-offs. A 4090 + 3090 combo gives you 48GB combined and can run Llama 70B at Q4. The layer split will be roughly equal (both are 24GB cards), so you will not lose much to imbalance. Performance will be close to the weaker card's throughput.
The RTX 5090 changes the calculus. At 32GB per card, two RTX 5090s give 64GB — enough for Llama 70B at Q8 (higher quality) and substantially larger context windows than a Q4 run on 48GB.
Use the VRAM calculator to check specific model + quantization combinations for your setup.
Setup Guide
Basic usage (it just works)
The most important thing to know is that Ollama requires no special configuration for multi-GPU. Install it, make sure your GPU drivers are up to date, and run your model.
# Install Ollama (if not installed)
curl -fsSL https://ollama.com/install.sh | sh
# Just run — Ollama auto-detects multiple GPUs
ollama run llama3.1:70b
On first run, Ollama downloads the model (this takes a while for 70B — expect 20–40GB depending on quantization). Once loaded, it will automatically split layers across your GPUs.
Verify GPU detection
Before running a large model, confirm Ollama sees all your GPUs.
# Check which GPUs Ollama is using (after loading a model)
ollama ps
# Check NVIDIA GPUs directly — look for memory usage on both cards
nvidia-smi
When a model is loaded across two GPUs, nvidia-smi should show significant memory usage on both cards. If only one card shows usage, see the troubleshooting section below.
Environment variables for control
Ollama exposes several environment variables that let you override its automatic GPU decisions.
# Use specific GPUs only (e.g., GPU 0 and 1, skip GPU 2)
CUDA_VISIBLE_DEVICES=0,1 ollama serve
# Force CPU-only mode (disable GPU entirely)
OLLAMA_NUM_GPU=0 ollama run llama3.1:70b
# Force maximum GPU offload (default behavior — loads as many layers as fit)
OLLAMA_NUM_GPU=99 ollama run llama3.1:70b
CUDA_VISIBLE_DEVICES is the most useful of these. If you have three GPUs but want Ollama to only use two of them (leaving the third for another task), set this before starting ollama serve.
The OLLAMA_NUM_GPU variable controls how many layers are sent to GPU versus staying on CPU. The default (99) means "use GPU for everything that fits." Setting it to 0 is a useful debugging tool to compare CPU-only performance.
AMD GPU setup
Ollama supports AMD GPUs through ROCm. Multi-GPU works with RDNA3 and CDNA3 generation cards.
# AMD GPUs may require a GFX version override for some models
# Replace 11.0.0 with your GPU's actual GFX version
HSA_OVERRIDE_GFX_VERSION=11.0.0 ollama run llama3.1:70b
Check AMD's ROCm compatibility matrix for your specific GPU model. RX 7900 XTX and RX 7900 XT are the most commonly tested consumer AMD cards with Ollama.
Performance Expectations
Multi-GPU on consumer PCIe has real overhead. Setting realistic expectations prevents disappointment.
The scaling factor: Consumer PCIe multi-GPU typically achieves about 0.70–0.75x efficiency per GPU. Two cards theoretically offer 2x capacity, but you capture roughly 1.4–1.5x the throughput of a single card. The remaining loss is inter-GPU communication over PCIe.
Compared to CPU offloading: If a 70B model partially spills to system RAM, you might see 2–5 tok/s. With two GPUs keeping the model entirely in VRAM, you get 20–30 tok/s. The 10x improvement over CPU offloading is what makes the overhead from multi-GPU coordination irrelevant in practice.
Prefill vs decode: Prompt processing (reading the prompt, the "prefill" phase) scales better across GPUs than token generation (the "decode" phase). If you process long context windows, multi-GPU provides a more proportional speedup there.
Approximate benchmarks for reference:
| Model | 1x RTX 4090 | 2x RTX 4090 | Notes |
|---|---|---|---|
| Llama 8B (Q4) | ~85 tok/s | ~85 tok/s | Fits on one GPU — no benefit from second |
| Llama 70B (Q4) | Does not fit | ~25–30 tok/s | Unlocked by combining VRAM |
| Mixtral 8x7B (Q4) | ~35 tok/s (tight) | ~45 tok/s | Fits in one 24GB card with room to spare |
The Llama 8B row is a reminder that multi-GPU only helps when the model actually uses both cards. For models that fit comfortably in a single GPU's VRAM, adding a second GPU does nothing for inference speed.
Troubleshooting
"Model too large" error even with two GPUs: Verify your GPUs are being detected. Run nvidia-smi and check both cards appear. Then confirm available VRAM — if other applications are using GPU memory, Ollama may see less than the full 48GB. Also try a smaller quantization: Q3_K_M uses roughly 25% less VRAM than Q4_K_M.
"Only using one GPU": The most common cause is that the Ollama server started before the second GPU was available (or before drivers were fully loaded). Restart the Ollama service:
# Linux systemd
sudo systemctl restart ollama
# macOS
# Quit Ollama from the menu bar and relaunch
Confirm with nvidia-smi that both cards show memory usage while a model is loaded.
Slow performance with two GPUs: Check your PCIe slot configuration. Both GPUs should be in x16 slots — if one is running at x8 due to motherboard limitations, bandwidth is halved for that card. Check with:
nvidia-smi -q -d PCIE | grep "Link Width"
You want Current Link Width: 16x for both GPUs. If one shows 8x, you have a slot bandwidth limitation that will affect performance.
Mixed GPU performance is worse than expected: Ollama handles mixed GPU configurations, but performance is bounded by the slower card. If you pair an RTX 4090 with a GTX 1080 Ti (11GB), the 1080 Ti becomes a bottleneck for every layer boundary crossing, and the old card's memory bandwidth limits the whole pipeline. The closer your two GPUs are in generation and memory bandwidth, the better the multi-GPU performance.
Ollama vs vLLM for Multi-GPU
Ollama is not the only option for multi-GPU local inference. vLLM is a common alternative, especially for anyone running inference as an API service.
| Feature | Ollama | vLLM |
|---|---|---|
| Setup complexity | Zero config | Requires --tensor-parallel-size N |
| Multi-GPU support | Automatic | Explicit, more control |
| Best for | Local / personal use | API serving, production |
| GGUF model support | Full | Limited |
| Continuous batching | No | Yes |
| Mixed GPU configurations | Supported | Not supported |
| AMD GPU support | Yes (ROCm) | Yes (ROCm) |
| Quantized model inference | Excellent (GGUF) | Good (AWQ, GPTQ) |
The rule of thumb: use Ollama for personal, local, interactive use. It is simpler, supports GGUF quantization (the most widely-used format for consumer inference), and handles multi-GPU transparently. Use vLLM if you are serving an API to multiple users simultaneously, need continuous batching for throughput, or are running in a production environment where per-request latency under load matters.
Most Ollama users are running one model for personal use and do not need the complexity of vLLM. If that is you, Ollama multi-GPU is the right choice.
Wrapping Up
Adding a second GPU to your Ollama setup is one of the highest-leverage hardware upgrades available for local AI enthusiasts. The jump from one 24GB GPU to two means the difference between running Llama 8B and running Llama 70B — a qualitative improvement in model capability, not just a speed bump.
The setup is genuinely simple: install both GPUs, update your drivers, and Ollama handles the rest. The 30% PCIe overhead is a real cost, but it is negligible compared to the alternative of CPU offloading or simply not being able to run the model at all.
For hardware recommendations and VRAM calculations specific to your setup, use the VRAM calculator or browse the hardware pages for the RTX 4090 and RTX 5090. And if you are weighing Ollama against more advanced multi-GPU tools, the main multi-GPU inference guide covers the full technical picture including NVLink scaling, tensor parallelism, and datacenter GPU configurations.