Local AI Build Glossary in 2026 - PCIe, ECC, NVLink, Offload, Bifurcation & More
The practical glossary for local AI builders. Understand PCIe lanes, ECC RAM, bifurcation, RDIMM vs UDIMM, NVLink, offload, tensor parallelism, KV cache, VRAM headroom, and the hardware terms that matter when building an AI workstation.
If you spend enough time in local AI forums, you eventually see the same pattern:
people understand the GPU model number, but not the platform vocabulary around it.
That is where expensive mistakes happen.
This glossary is meant to fix that. It is not an academic encyclopedia. It is the working vocabulary you need to read hardware discussions, plan a build, and understand why some AI rigs scale well while others become expensive dead ends.
If you want the applied version, pair this with How to Build a Local AI Workstation in 2026 and PCIe Lanes for Local AI Explained.
A
Active parameters
In a Mixture-of-Experts model, active parameters are the subset of total weights used for a given token. Memory sizing still depends on total model storage, but speed is influenced more by active parameters.
Airflow
Not a buzzword. In AI boxes, airflow determines whether your GPUs sustain clocks or turn into throttling heaters.
B
Bifurcation
The ability to split a PCIe link into smaller links:
- x16 into x8/x8
- x16 into x4/x4/x4/x4
It matters for dual-GPU designs and quad-M.2 carrier cards.
Blower GPU
A GPU cooler that pushes hot air out of the chassis rather than dumping it back into the case. Less pretty than open-air coolers, often much more practical in dense workstation or server builds.
Bottleneck
Any component that limits usable performance. In local AI that can be:
- VRAM
- system RAM
- PCIe topology
- storage
- thermals
- software
C
Chipset-attached lanes
Expansion paths that reach the CPU through the chipset uplink. Useful, but not equivalent to direct CPU lanes.
Chipset uplink
The connection between CPU and chipset. Devices behind the chipset share this path.
Context window
The maximum token history the model can use. Bigger contexts require more KV cache and therefore more memory.
CPU-attached lanes
The premium PCIe lanes. This is where you want GPUs, key NVMe drives, and serious networking whenever possible.
CPU offload
Running part of a model or workload in system RAM because VRAM is not enough. It makes oversized models possible and almost always slower.
D
Decode TPS
Tokens per second during generation. One of the most useful real-world speed metrics for local LLM inference.
DMI
Intel's CPU-to-chipset link. Functionally similar to the chipset uplink concept you need to care about on any mainstream platform.
Dual-slot / triple-slot
How physically thick a GPU is. This matters for airflow and whether multiple GPUs can actually fit.
E
ECC
Error-Correcting Code memory. It detects and corrects certain memory errors. The more memory, uptime, and money involved, the more sensible ECC becomes.
Effective VRAM
The useful memory capacity once topology, interconnect overhead, and real software behavior are considered. Raw VRAM and usable system behavior are not always the same thing.
F
Fine-tuning
Updating model weights on your hardware instead of just running inference. Usually much more demanding than inference in memory, storage, and workflow complexity.
FP16 / BF16 / FP8
Floating-point precision formats. Higher precision generally uses more memory and can preserve more quality. Lower precision saves memory and often improves speed if the software and hardware support it well.
G
Gen4 / Gen5 PCIe
PCIe generations. Each generation increases per-lane bandwidth. Gen5 gives more bandwidth than Gen4 at the same lane width.
GPU topology
The practical layout of how GPUs connect to CPU, switches, and each other. This matters much more in multi-GPU builds than people expect.
H
Headroom
The safety margin left after weights, runtime overhead, KV cache, and other allocations. Systems with zero headroom tend to be miserable even when they technically boot.
HEDT
High-End Desktop. The class between mainstream desktop and server. In AI, this usually means workstation-oriented platforms with more lanes and better expansion than normal consumer boards.
I
Inference
Running a model to generate outputs. Most local AI users care about inference, not training.
Interconnect
The connection used between GPUs in multi-GPU systems. Examples include PCIe, NVLink, and Infinity Fabric.
K
KV cache
The attention cache used during autoregressive generation. Larger contexts and some workloads increase KV cache memory significantly.
L
Lane
A PCIe lane is a unit of PCIe bandwidth. Devices use x1, x4, x8, x16, and other lane widths depending on what they need and what the platform can provide.
Latency
How long an operation takes to respond. In local AI, high latency can come from CPU offload, slow storage, bad topology, or underpowered hardware even when average throughput looks acceptable.
M
Mechanical x16 slot
A slot physically shaped like x16. It might still be electrically wired as x8 or x4. This distinction matters.
Mixture of Experts
Model architecture where only part of the network is active per token. Great for efficiency, but total weights still need to be stored.
Motherboard lane map
The practical truth table of where slots and M.2 ports really connect. One of the most important documents for AI builders and one of the least-read.
N
NIC
Network Interface Card. Relevant once you care about 10GbE, 25GbE, 100GbE, shared storage, or remote serving.
NVLink
NVIDIA's dedicated GPU interconnect. Important for some multi-GPU setups because it reduces communication overhead versus plain PCIe.
NVMe
Fast PCIe-attached storage. This is what you want for model storage, cache, and scratch work on serious local AI systems.
O
OCuLink
A compact PCIe cabling standard that appears in some storage and external expansion setups. Useful in some workstation and server designs.
Offload
Moving part of the work out of GPU memory, usually into system RAM or CPU execution. Helpful, slower, sometimes necessary.
Open-air GPU
A card that dumps a lot of heat back into the case. Great in some single-GPU desktops. Much less ideal in dense multi-GPU setups.
P
PCIe switch
A chip that fans one upstream PCIe link out to several downstream devices. Great for connectivity flexibility. Not a magic way to create more upstream bandwidth.
PCIe topology
How all PCIe devices connect in the real machine:
- direct to CPU
- through chipset
- through a switch
- through risers or carriers
This is one of the most important concepts in multi-GPU AI builds.
Prefill
The stage where the model processes the prompt before generation begins. Important for latency and large-context workloads.
Q
Quantization
Compressing model weights to lower precision to reduce memory use and often improve speed.
Q4_K_M
A common GGUF quantization level used as a practical quality-memory baseline in local inference discussions.
R
RDIMM
Registered DIMM memory, commonly used in workstation and server platforms. Usually the cleaner path for large ECC builds.
Riser
An adapter or cable that lets you reposition PCIe devices. Useful, but not a replacement for sane topology.
S
Scratch storage
Temporary fast storage for generated outputs, datasets, model conversions, and cache. Often ignored until a machine starts doing real work.
Serving
Running models as a persistent service, usually local APIs or shared endpoints. This pushes builds toward more RAM, better topology, and more reliability.
Slot spacing
The physical space between expansion slots. Multi-GPU builds can fail mechanically or thermally even when lane counts look fine on paper.
T
Tensor parallelism
Splitting model computation across multiple GPUs. One of the core software strategies behind multi-GPU inference.
Thermal budget
The heat your system can actually remove in sustained use. AI workloads can be much more punishing than bursty gaming loads.
Topology
The real map of how devices connect. In multi-GPU systems, topology is often more important than small headline differences in CPU model.
U
UDIMM
Unbuffered DIMM memory. Common in consumer systems. ECC UDIMM exists, but support is platform- and board-dependent.
Unified memory
Shared CPU/GPU memory architecture, most prominently associated with Apple Silicon. Useful for capacity, but different from discrete GPU VRAM behavior.
V
VRAM
GPU memory. The first hard wall most local AI users run into.
VRAM headroom
How much memory remains after weights, runtime overhead, KV cache, and working buffers. This determines whether a fit is comfortable or fragile.
X
x4 / x8 / x16
Lane widths for PCIe links. Bigger widths provide more bandwidth, but the usefulness depends on workload and overall topology.
Final Take
You do not need to memorize every term here.
You do need to understand enough of them that the next time someone says:
- "just add a second GPU"
- "that board has four x16 slots"
- "ECC is overkill"
- "the chipset lanes are fine"
you can tell whether they are describing a real AI workstation or a future headache.
For the practical build side, continue with Best Local AI Builds in 2026.