Will It Run AI

Methodology

How WillItRunAI estimates fit, performance, and recommendations.

Every estimate on this site is produced by a deterministic engine — no “vibes,” no hidden LLM calls, no opaque scores. This page documents the actual formulas, thresholds, and weights so you can verify or challenge any result.

1

How It Works

The platform follows a four-stage pipeline that runs entirely in your browser or on our API — never on a third-party LLM.

Detect

Identify your GPU, VRAM, memory bandwidth, and compute capability via WebGPU/WebGL or manual selection.

Estimate Fit

Calculate whether each model's weights, KV cache, and runtime overhead fit your available memory.

Score

Combine model quality, fit status, inference speed, and VRAM efficiency into a 0–100 compatibility score, then map to an S/A/B/C/D/F grade.

Recommend

Rank all eligible models by a weighted composite of quality, fit, utilization, speed, specialization, and freshness.

2

Hardware Detection

When you visit the calculator, the site attempts to auto-detect your hardware through the browser's GPU APIs. Detection proceeds through three tiers of decreasing confidence:

WebGPU

The preferred path. Returns the exact adapter name and VRAM limits. Available in Chrome 113+, Edge 113+, and recent Firefox Nightly builds.

WebGL

Fallback when WebGPU is unavailable. Reads the WEBGL_debug_renderer_info extension for the renderer string. VRAM is not exposed, so it is inferred from the catalog.

Manual Override

You can always select your hardware from our curated catalog. This is the only option on Safari and older browsers, and it guarantees exact specs.

Apple Silicon

Macs with Apple Silicon use unified memory shared between CPU and GPU. The engine applies a 0.72 utilization factor to the total unified memory (e.g. a 36 GB M4 Pro yields ~25.9 GB usable for inference) and uses a 0.15 headroom ratio instead of the 0.10 used for discrete GPUs. Offload severity is also much lower (0.3) because the “offload” stays within the same memory fabric.

3

VRAM Estimation

The total memory required by a model is the sum of four components:

requiredMemory = weightsMemory + runtimeOverhead + kvCache + headroom

weightsMemory  = paramsTotalB  * gbPerBillion(quant)
kvCache        = paramsB * 0.030 * (contextTokens / 4096)   // fallback
                 ≈ paramsB * 0.06  at 8192 context
headroom       = totalAvailableMemory * headroomRatio
                 (Apple Silicon: 0.15 | Discrete GPU: 0.10)

When architecture metadata is available (numLayers, numKvHeads, headDim), the engine uses an exact per-layer KV cache calculation instead of the parameter-scaled approximation shown above.

The gbPerBillion constant comes from the quantization profile. These values include GGUF metadata and structural overhead, not just raw weight bytes:

QuantBitsGB / Billion paramsQualityThroughput multiplier
Q2_K20.34Low1.16
Q3_K_S30.46Low1.12
Q4_K_M40.58Medium1.08
Q5_K_M50.68High1.01
Q6_K60.78High0.97
Q8_081.02Very High0.90
F16162.00Maximum0.76

Example: Llama 3.1 70B at Q4_K_M needs roughly 70 * 0.58 = 40.6 GB just for weights. Add runtime overhead (~1 GB), KV cache, and headroom to get the total required memory.

4

Fit Classification

The engine computes a ratio = requiredMemory / availableMemory and maps it to one of five fit statuses:

StatusRatio thresholdLabelMeaning
native_fit≤ 0.82Runs wellComfortable headroom — full context, no offload.
tight_fit≤ 0.95Tight fitFits, but headroom is limited. Reducing context may help.
hybrid_fit≤ 1.08Runs with offloadExceeds VRAM slightly. Some layers offloaded to system RAM.
unsafe_fit≤ 1.20Very compromisedSignificant offload required. Performance will degrade substantially.
no_fit> 1.20Too heavyModel does not fit this hardware even with offload.

Offload validation

When the fit status is hybrid_fit or unsafe_fit, the engine checks whether the host system has enough RAM to absorb the offloaded layers. It reserves 40% of system RAM for the OS and compares the remainder against the offloaded weight portion. If host RAM is insufficient, the status is downgraded (hybrid → unsafe, unsafe → no_fit).

Offload severity by interconnect

The performance penalty of offloading depends on how fast data moves between GPU and host memory. The engine assigns an offload severity factor per interconnect type:

Apple Silicon (unified memory): 0.30
PCIe 4.0:                      0.85
PCIe 3.0:                      0.92
Thunderbolt / USB:             0.95
Default (unknown):             0.85

fitPenalty = max(0.15, 1.0 - offloadFraction * offloadSeverity)

5

Performance Estimation

The engine estimates two key throughput metrics: decode TPS (tokens per second during generation) and prefill TPS (tokens per second during prompt ingestion).

Decode throughput (memory-bandwidth-bound)

decodeTps = max(2,
  (bandwidth / throughputWeightsMemory)
  * efficiency
  * fitPenalty
  * quant.throughputMultiplier
)

bandwidth              = memoryBandwidthGbps * multiGpuScaling
throughputWeightsMemory = throughputParamBasis * gbPerBillion(quant)
efficiency              = Apple Silicon: 0.62 | Discrete GPU: 0.70

For Mixture-of-Experts models, throughputParamBasis is not simply total params. The engine uses activeParams * 1.1 + (totalParams - activeParams) * 0.3 to account for expert routing overhead while discounting inactive parameters that are not touched per token.

Prefill throughput (compute-bound)

computeBoundPrefill    = (fp16Tflops * 1000) / paramsActiveB
bandwidthBoundPrefill  = decodeTps * 2.5
prefillTps             = min(computeBoundPrefill, bandwidthBoundPrefill)

Prefill is typically compute-bound (limited by FP16 TFLOPS), but on bandwidth-constrained hardware it can be bandwidth-bound instead. The engine takes the minimum of both estimates.

Time to first token (TTFT)

ttftMs = max(350, (promptTokens / prefillTps) * 220)

A floor of 350 ms accounts for runtime initialization, memory allocation, and first-token scheduling latency that exist regardless of model size.

Safe context window

memorySafeContext = max(4096,
  floor((availableMemory / requiredMemory) * workloadContextTokens)
)
maxSafeContext = min(memorySafeContext, officialContextTokens)

This ensures we never recommend a context length that would exceed the model's official limit or cause out-of-memory errors.

6

Scoring & Grades

The compatibility score answers one question: “what's the best model I can run on my hardware?” It's a 0–100 composite that weights model quality most heavily (45 pts), treats fit as a prerequisite rather than a differentiator (25 pts), and rewards speed (15 pts) and VRAM efficiency (15 pts) without penalizing tight fits. The result maps to a letter grade.

compatibilityScore

One unified function used everywhere — quality is the dominant component, fit is a prerequisite (not a differentiator), and tight fits are never penalized:

fitBase (0-25) =                         ← prerequisite, not differentiator
  native_fit (no offload)  : 25
  tight_fit  (no offload)  : 22           ← only 3pt gap from native
  offloadRatio <= 0.10     : 12  (light offload)
  offloadRatio <= 0.30     :  6  (moderate offload)
  offloadRatio >  0.30     :  2  (heavy offload)
  no_fit                   :  0

qualityScore    (0-45) = qualityTier / 100 × 45   ← the main differentiator
speedScore      (0-15) = min(15, log₂(decodeTps + 1) × 2.3)
efficiencyScore (0-15) = min(15, (vramUtil / 0.70) × 15)
                         ↑ one-sided ramp: caps at 70% util, no penalty above

score = clamp(0, 100, fitBase + qualityScore + speedScore + efficiencyScore)

Why quality-dominant? The user asks “what's the best model for my GPU?” — not “what's the most comfortable model?” Among models that fit, a tier-95 model beats a tier-50 model by ~20 points — enough for a full grade difference.

Why no tight-fit penalty? Using 90%+ of VRAM means you're running the biggest model that fits — that's a good thing. The one-sided ramp only penalizes under-utilization (wasting GPU capacity on a tiny model).

Grade thresholds

GradeMin scoreLabel
S85Excellent
A70Great
B55Good
C40Usable
D20Poor
F< 20Won’t run

7

Recommendation Engine

The recommendation engine scores every eligible model against your hardware and workload, then returns the top 3 by composite score. The composite is not a simple weighted average — it is an additive sum of purpose-built sub-scores, each tuned to a different signal.

compositeScore =
    qualityTier     * (5.8 + qualityWeight * 3.2)
  + useCaseAdjustment
  + freshnessScore
  + legacyPenalty
  + fitBoost
  + sizeScore       * (0.7 + qualityWeight)
  + architectureScore
  + utilizationScore * (0.55 + qualityWeight * 0.35)
  + speedScore
  + latencyScore
  + contextScore

Quality tier

The model's quality rating (1–5 scale) multiplied by a weight derived from the workload's quality preference. Higher-quality models score higher, especially for demanding workloads.

Fit boost

A direct bonus or penalty based on fit status: native_fit +24, tight_fit +12, hybrid_fit +3, unsafe_fit −12, no_fit −30. This ensures runnable models always rank above non-runnable ones.

Use-case specialization

Models tagged for the current workload (e.g. a code-specialist model for a coding workload) receive a significant boost (+14 to +26). Generalist models receive a penalty (−12 to −18) on specialist workloads.

Parameter fit score

Compares model size against workload-specific ideal parameter ranges that scale with available memory. Too small is penalized; the sweet spot scores highest; oversized models taper off.

Memory utilization

Rewards models that use your VRAM efficiently. The ideal utilization target scales with total available memory (52% for ≤12 GB up to 74% for >192 GB). Distance from the target reduces the score.

Speed score

Derived from decode TPS: min(12, sqrt(decodeTps)) * (0.55 + throughputWeight). Faster inference earns a meaningful but capped bonus.

Latency score

Derived from TTFT: min(6, 900 / max(250, ttftMs)) * (0.35 + latencyWeight * 0.5). Lower latency is rewarded.

Context alignment

Compares the workload's desired context length against the model's safe context window. Full coverage earns +7 to +11; shortfalls incur up to −26.

Freshness

Frontier models get a boost (+4.5 to +7), legacy models get a penalty (−8 or more). Models available on multiple distribution channels (HuggingFace, Ollama, LM Studio) earn small bonuses. MoE architectures and large context windows also earn freshness credit.

Architecture score

MoE/sparse models get a boost on coding, agentic-coding, and reasoning workloads (+5 to +6.5), especially on high-memory hardware (≥64 GB). Dense models rarely receive an architecture bonus unless they are very high quality.

8

Multi-GPU Support

For discrete GPUs, the engine supports multi-GPU configurations. Apple Silicon is always treated as a single device because its unified memory is already shared.

effectiveGpuCount = isAppleSilicon ? 1 : max(1, gpuCount)
multiGpuScaling   = gpuCount > 1 ? gpuCount * 0.85 : 1.0

totalMemory = singleGpuMemory * effectiveGpuCount
bandwidth   = memoryBandwidthGbps * multiGpuScaling
fp16Tflops  = fp16Tflops * multiGpuScaling

The 0.85 scaling factor accounts for inter-GPU communication overhead (tensor parallelism synchronization, PCIe/NVLink transfer latency). Two GPUs yield ~1.7x effective bandwidth and VRAM, not 2x.

9

Data Sources

The catalog that powers fit and recommendation scores is built from three layers of data:

Curated Catalog

Hand-verified hardware specs (GPUs, Apple Silicon, consumer and datacenter), runtime profiles (llama.cpp, Ollama, vLLM, etc.), model variants with quality tiers, and workload definitions. Stored as structured JSON in the repository and seeded into Postgres.

HuggingFace

Automated ingestion from the HuggingFace Hub API: parameter counts, architecture metadata, quantization availability, and download stats. Updated regularly to capture new model releases.

Ollama

The Ollama model library is ingested to track which models are immediately available via the most popular local inference runner. This feeds into distribution channel scoring.

All catalog data passes through ingestion pipelines that normalize specs, resolve conflicts between sources, and assign quality tiers. The Rust API loads the full catalog into memory at startup for sub-millisecond query response times.

Image & Video Generation

For diffusion models (Flux, SDXL, Stable Diffusion, video generators), we use a different methodology tailored to how these models work.

Memory Model

Diffusion models have multiple components that share VRAM: the denoising backbone (UNet or DiT), VAE decoder, text encoder(s), and temporary activations. Unlike LLMs where KV cache scales with context length, diffusion memory scales with image resolution (quadratically with pixel count) and frame count for video.

  • Backbone weights — Scaled by precision (FP16=2×, FP8=1×, NF4=0.5× per billion params)
  • VAE — Always FP16 (quality-critical for image decode)
  • Text encoder — FP16 by default, FP8 when backbone uses lower precision
  • Activations — Resolution-dependent, sub-linear with frame count for video
  • Sequential offload — Pipeline components can run one at a time (text encoder → backbone → VAE), needing only the largest single component in VRAM

Speed Estimation

Generation speed depends on model size, resolution, inference steps, and GPU compute. We calibrate against real community benchmarks (SD 1.5 ~2.5s, SDXL ~6s, Flux ~12s on RTX 4090 at native resolution). Speed scales sub-linearly with parameter count because larger models achieve better GPU utilization.

Evidence-First

When real benchmark data exists for a model+hardware+resolution combination, we use the measured values instead of estimates. You'll see a green “Measured” indicator on results backed by real data. Our database includes measurements from official model cards, community benchmarks, and HuggingFace documentation.

Runtimes

We track diffusion runtimes including ComfyUI, Automatic1111/Forge, InvokeAI, and the diffusers Python library. Each has different optimization capabilities (xformers, torch.compile, sequential offloading) that affect real-world performance.

10

What Makes Us Different

Several other tools answer “can I run this model?” — most notably canirun.ai. Here is how WillItRunAI goes further:

Dual engine (Rust + TypeScript)

The core scoring logic runs as a compiled Rust API for production accuracy, with a TypeScript mirror in the browser for instant offline estimates. Both implementations share the same formulas and thresholds.

Workload-aware recommendations

Most tools only check if a model fits. We rank models for your specific workload — coding, agentic coding, reasoning, RAG, or general chat — with different parameter targets, specialization boosts, and quality weights per workload.

Runtime-aware estimation

Different runtimes (llama.cpp, Ollama, vLLM, MLX) have different memory overheads and backend requirements. Our engine accounts for per-runtime overhead and recommends the best runtime for each hardware/model pair.

Multi-GPU support

We model tensor parallelism scaling with realistic efficiency factors, rather than simply multiplying VRAM. Inter-GPU communication overhead is factored into bandwidth and throughput estimates.

Offload modeling

Instead of a binary fits/doesn’t-fit answer, we model partial offload to system RAM with interconnect-specific penalties (PCIe 3.0 vs 4.0 vs Thunderbolt), host RAM validation, and graduated fit statuses.

Performance estimation

We estimate decode TPS, prefill TPS, TTFT, and safe context window — not just whether it fits. This lets you compare how well different models will actually perform on your hardware.

Transparent scoring

Every formula, weight, and threshold is documented on this page. Hover any score to see the exact 4-component breakdown (fit, quality, speed, efficiency). No opaque "AI scores" — you can reproduce any result with a calculator.

Curated quality tiers

Models are assigned quality tiers based on benchmark performance, community consensus, and architecture generation. Quality is the single largest factor in our compatibility score (45 of 100 points), ensuring we recommend the best model that fits — not just the most comfortable one.