Will It Run AI
glossary, hardware, pcie, ecc, multi-gpu, local-ai

Local AI Build Glossary in 2026 - PCIe, ECC, NVLink, Offload, Bifurcation & More

The practical glossary for local AI builders. Understand PCIe lanes, ECC RAM, bifurcation, RDIMM vs UDIMM, NVLink, offload, tensor parallelism, KV cache, VRAM headroom, and the hardware terms that matter when building an AI workstation.

If you spend enough time in local AI forums, you eventually see the same pattern:

people understand the GPU model number, but not the platform vocabulary around it.

That is where expensive mistakes happen.

This glossary is meant to fix that. It is not an academic encyclopedia. It is the working vocabulary you need to read hardware discussions, plan a build, and understand why some AI rigs scale well while others become expensive dead ends.

If you want the applied version, pair this with How to Build a Local AI Workstation in 2026 and PCIe Lanes for Local AI Explained.


A

Active parameters

In a Mixture-of-Experts model, active parameters are the subset of total weights used for a given token. Memory sizing still depends on total model storage, but speed is influenced more by active parameters.

Airflow

Not a buzzword. In AI boxes, airflow determines whether your GPUs sustain clocks or turn into throttling heaters.


B

Bifurcation

The ability to split a PCIe link into smaller links:

  • x16 into x8/x8
  • x16 into x4/x4/x4/x4

It matters for dual-GPU designs and quad-M.2 carrier cards.

Blower GPU

A GPU cooler that pushes hot air out of the chassis rather than dumping it back into the case. Less pretty than open-air coolers, often much more practical in dense workstation or server builds.

Bottleneck

Any component that limits usable performance. In local AI that can be:

  • VRAM
  • system RAM
  • PCIe topology
  • storage
  • thermals
  • software

C

Chipset-attached lanes

Expansion paths that reach the CPU through the chipset uplink. Useful, but not equivalent to direct CPU lanes.

Chipset uplink

The connection between CPU and chipset. Devices behind the chipset share this path.

Context window

The maximum token history the model can use. Bigger contexts require more KV cache and therefore more memory.

CPU-attached lanes

The premium PCIe lanes. This is where you want GPUs, key NVMe drives, and serious networking whenever possible.

CPU offload

Running part of a model or workload in system RAM because VRAM is not enough. It makes oversized models possible and almost always slower.


D

Decode TPS

Tokens per second during generation. One of the most useful real-world speed metrics for local LLM inference.

DMI

Intel's CPU-to-chipset link. Functionally similar to the chipset uplink concept you need to care about on any mainstream platform.

Dual-slot / triple-slot

How physically thick a GPU is. This matters for airflow and whether multiple GPUs can actually fit.


E

ECC

Error-Correcting Code memory. It detects and corrects certain memory errors. The more memory, uptime, and money involved, the more sensible ECC becomes.

Effective VRAM

The useful memory capacity once topology, interconnect overhead, and real software behavior are considered. Raw VRAM and usable system behavior are not always the same thing.


F

Fine-tuning

Updating model weights on your hardware instead of just running inference. Usually much more demanding than inference in memory, storage, and workflow complexity.

FP16 / BF16 / FP8

Floating-point precision formats. Higher precision generally uses more memory and can preserve more quality. Lower precision saves memory and often improves speed if the software and hardware support it well.


G

Gen4 / Gen5 PCIe

PCIe generations. Each generation increases per-lane bandwidth. Gen5 gives more bandwidth than Gen4 at the same lane width.

GPU topology

The practical layout of how GPUs connect to CPU, switches, and each other. This matters much more in multi-GPU builds than people expect.


H

Headroom

The safety margin left after weights, runtime overhead, KV cache, and other allocations. Systems with zero headroom tend to be miserable even when they technically boot.

HEDT

High-End Desktop. The class between mainstream desktop and server. In AI, this usually means workstation-oriented platforms with more lanes and better expansion than normal consumer boards.


I

Inference

Running a model to generate outputs. Most local AI users care about inference, not training.

Interconnect

The connection used between GPUs in multi-GPU systems. Examples include PCIe, NVLink, and Infinity Fabric.


K

KV cache

The attention cache used during autoregressive generation. Larger contexts and some workloads increase KV cache memory significantly.


L

Lane

A PCIe lane is a unit of PCIe bandwidth. Devices use x1, x4, x8, x16, and other lane widths depending on what they need and what the platform can provide.

Latency

How long an operation takes to respond. In local AI, high latency can come from CPU offload, slow storage, bad topology, or underpowered hardware even when average throughput looks acceptable.


M

Mechanical x16 slot

A slot physically shaped like x16. It might still be electrically wired as x8 or x4. This distinction matters.

Mixture of Experts

Model architecture where only part of the network is active per token. Great for efficiency, but total weights still need to be stored.

Motherboard lane map

The practical truth table of where slots and M.2 ports really connect. One of the most important documents for AI builders and one of the least-read.


N

NIC

Network Interface Card. Relevant once you care about 10GbE, 25GbE, 100GbE, shared storage, or remote serving.

NVLink

NVIDIA's dedicated GPU interconnect. Important for some multi-GPU setups because it reduces communication overhead versus plain PCIe.

NVMe

Fast PCIe-attached storage. This is what you want for model storage, cache, and scratch work on serious local AI systems.


O

OCuLink

A compact PCIe cabling standard that appears in some storage and external expansion setups. Useful in some workstation and server designs.

Offload

Moving part of the work out of GPU memory, usually into system RAM or CPU execution. Helpful, slower, sometimes necessary.

Open-air GPU

A card that dumps a lot of heat back into the case. Great in some single-GPU desktops. Much less ideal in dense multi-GPU setups.


P

PCIe switch

A chip that fans one upstream PCIe link out to several downstream devices. Great for connectivity flexibility. Not a magic way to create more upstream bandwidth.

PCIe topology

How all PCIe devices connect in the real machine:

  • direct to CPU
  • through chipset
  • through a switch
  • through risers or carriers

This is one of the most important concepts in multi-GPU AI builds.

Prefill

The stage where the model processes the prompt before generation begins. Important for latency and large-context workloads.


Q

Quantization

Compressing model weights to lower precision to reduce memory use and often improve speed.

Q4_K_M

A common GGUF quantization level used as a practical quality-memory baseline in local inference discussions.


R

RDIMM

Registered DIMM memory, commonly used in workstation and server platforms. Usually the cleaner path for large ECC builds.

Riser

An adapter or cable that lets you reposition PCIe devices. Useful, but not a replacement for sane topology.


S

Scratch storage

Temporary fast storage for generated outputs, datasets, model conversions, and cache. Often ignored until a machine starts doing real work.

Serving

Running models as a persistent service, usually local APIs or shared endpoints. This pushes builds toward more RAM, better topology, and more reliability.

Slot spacing

The physical space between expansion slots. Multi-GPU builds can fail mechanically or thermally even when lane counts look fine on paper.


T

Tensor parallelism

Splitting model computation across multiple GPUs. One of the core software strategies behind multi-GPU inference.

Thermal budget

The heat your system can actually remove in sustained use. AI workloads can be much more punishing than bursty gaming loads.

Topology

The real map of how devices connect. In multi-GPU systems, topology is often more important than small headline differences in CPU model.


U

UDIMM

Unbuffered DIMM memory. Common in consumer systems. ECC UDIMM exists, but support is platform- and board-dependent.

Unified memory

Shared CPU/GPU memory architecture, most prominently associated with Apple Silicon. Useful for capacity, but different from discrete GPU VRAM behavior.


V

VRAM

GPU memory. The first hard wall most local AI users run into.

VRAM headroom

How much memory remains after weights, runtime overhead, KV cache, and working buffers. This determines whether a fit is comfortable or fragile.


X

x4 / x8 / x16

Lane widths for PCIe links. Bigger widths provide more bandwidth, but the usefulness depends on workload and overall topology.


Final Take

You do not need to memorize every term here.

You do need to understand enough of them that the next time someone says:

  • "just add a second GPU"
  • "that board has four x16 slots"
  • "ECC is overkill"
  • "the chipset lanes are fine"

you can tell whether they are describing a real AI workstation or a future headache.

For the practical build side, continue with Best Local AI Builds in 2026.

Frequently Asked Questions

What is the most important local AI hardware term to understand first?

PCIe lanes. They determine whether your machine is really a one-GPU desktop, a proper workstation, or a compromised multi-device build.

What is the difference between ECC, RDIMM, and UDIMM?

ECC means memory with error correction. UDIMM and RDIMM describe memory form factors and signaling behavior. Consumer systems often use non-ECC UDIMM or ECC UDIMM. Workstation and server platforms commonly use ECC RDIMM.

What does CPU offload mean?

CPU offload means part of the model or workload spills from GPU memory into system RAM. It lets oversized models run, but usually with a large performance penalty compared with full GPU fit.

What is PCIe bifurcation?

It is the platform's ability to split one bigger PCIe link, such as x16, into smaller links like x8/x8 or x4/x4/x4/x4. It matters for multi-GPU and multi-NVMe expansion.

What is tensor parallelism?

Tensor parallelism splits model computation across multiple GPUs. It is one of the standard ways to run larger models across several cards.

What is VRAM headroom?

VRAM headroom is the memory margin left after loading weights, runtime overhead, KV cache, and other working data. A model that technically fits with no headroom can still be a bad user experience.