Detect
Identify your GPU, VRAM, memory bandwidth, and compute capability via WebGPU/WebGL or manual selection.
Methodology
Every estimate on this site is produced by a deterministic engine — no “vibes,” no hidden LLM calls, no opaque scores. This page documents the actual formulas, thresholds, and weights so you can verify or challenge any result.
1
The platform follows a four-stage pipeline that runs entirely in your browser or on our API — never on a third-party LLM.
Detect
Identify your GPU, VRAM, memory bandwidth, and compute capability via WebGPU/WebGL or manual selection.
Estimate Fit
Calculate whether each model's weights, KV cache, and runtime overhead fit your available memory.
Score
Combine model quality, fit status, inference speed, and VRAM efficiency into a 0–100 compatibility score, then map to an S/A/B/C/D/F grade.
Recommend
Rank all eligible models by a weighted composite of quality, fit, utilization, speed, specialization, and freshness.
2
When you visit the calculator, the site attempts to auto-detect your hardware through the browser's GPU APIs. Detection proceeds through three tiers of decreasing confidence:
WebGPU
The preferred path. Returns the exact adapter name and VRAM limits. Available in Chrome 113+, Edge 113+, and recent Firefox Nightly builds.
WebGL
Fallback when WebGPU is unavailable. Reads the WEBGL_debug_renderer_info extension for the renderer string. VRAM is not exposed, so it is inferred from the catalog.
Manual Override
You can always select your hardware from our curated catalog. This is the only option on Safari and older browsers, and it guarantees exact specs.
Apple Silicon
Macs with Apple Silicon use unified memory shared between CPU and GPU. The engine applies a 0.72 utilization factor to the total unified memory (e.g. a 36 GB M4 Pro yields ~25.9 GB usable for inference) and uses a 0.15 headroom ratio instead of the 0.10 used for discrete GPUs. Offload severity is also much lower (0.3) because the “offload” stays within the same memory fabric.
3
The total memory required by a model is the sum of four components:
requiredMemory = weightsMemory + runtimeOverhead + kvCache + headroom
weightsMemory = paramsTotalB * gbPerBillion(quant)
kvCache = paramsB * 0.030 * (contextTokens / 4096) // fallback
≈ paramsB * 0.06 at 8192 context
headroom = totalAvailableMemory * headroomRatio
(Apple Silicon: 0.15 | Discrete GPU: 0.10)When architecture metadata is available (numLayers, numKvHeads, headDim), the engine uses an exact per-layer KV cache calculation instead of the parameter-scaled approximation shown above.
The gbPerBillion constant comes from the quantization profile. These values include GGUF metadata and structural overhead, not just raw weight bytes:
| Quant | Bits | GB / Billion params | Quality | Throughput multiplier |
|---|---|---|---|---|
| Q2_K | 2 | 0.34 | Low | 1.16 |
| Q3_K_S | 3 | 0.46 | Low | 1.12 |
| Q4_K_M | 4 | 0.58 | Medium | 1.08 |
| Q5_K_M | 5 | 0.68 | High | 1.01 |
| Q6_K | 6 | 0.78 | High | 0.97 |
| Q8_0 | 8 | 1.02 | Very High | 0.90 |
| F16 | 16 | 2.00 | Maximum | 0.76 |
Example: Llama 3.1 70B at Q4_K_M needs roughly 70 * 0.58 = 40.6 GB just for weights. Add runtime overhead (~1 GB), KV cache, and headroom to get the total required memory.
4
The engine computes a ratio = requiredMemory / availableMemory and maps it to one of five fit statuses:
| Status | Ratio threshold | Label | Meaning |
|---|---|---|---|
| native_fit | ≤ 0.82 | Runs well | Comfortable headroom — full context, no offload. |
| tight_fit | ≤ 0.95 | Tight fit | Fits, but headroom is limited. Reducing context may help. |
| hybrid_fit | ≤ 1.08 | Runs with offload | Exceeds VRAM slightly. Some layers offloaded to system RAM. |
| unsafe_fit | ≤ 1.20 | Very compromised | Significant offload required. Performance will degrade substantially. |
| no_fit | > 1.20 | Too heavy | Model does not fit this hardware even with offload. |
Offload validation
When the fit status is hybrid_fit or unsafe_fit, the engine checks whether the host system has enough RAM to absorb the offloaded layers. It reserves 40% of system RAM for the OS and compares the remainder against the offloaded weight portion. If host RAM is insufficient, the status is downgraded (hybrid → unsafe, unsafe → no_fit).
Offload severity by interconnect
The performance penalty of offloading depends on how fast data moves between GPU and host memory. The engine assigns an offload severity factor per interconnect type:
Apple Silicon (unified memory): 0.30 PCIe 4.0: 0.85 PCIe 3.0: 0.92 Thunderbolt / USB: 0.95 Default (unknown): 0.85 fitPenalty = max(0.15, 1.0 - offloadFraction * offloadSeverity)
5
The engine estimates two key throughput metrics: decode TPS (tokens per second during generation) and prefill TPS (tokens per second during prompt ingestion).
Decode throughput (memory-bandwidth-bound)
decodeTps = max(2, (bandwidth / throughputWeightsMemory) * efficiency * fitPenalty * quant.throughputMultiplier ) bandwidth = memoryBandwidthGbps * multiGpuScaling throughputWeightsMemory = throughputParamBasis * gbPerBillion(quant) efficiency = Apple Silicon: 0.62 | Discrete GPU: 0.70
For Mixture-of-Experts models, throughputParamBasis is not simply total params. The engine uses activeParams * 1.1 + (totalParams - activeParams) * 0.3 to account for expert routing overhead while discounting inactive parameters that are not touched per token.
Prefill throughput (compute-bound)
computeBoundPrefill = (fp16Tflops * 1000) / paramsActiveB bandwidthBoundPrefill = decodeTps * 2.5 prefillTps = min(computeBoundPrefill, bandwidthBoundPrefill)
Prefill is typically compute-bound (limited by FP16 TFLOPS), but on bandwidth-constrained hardware it can be bandwidth-bound instead. The engine takes the minimum of both estimates.
Time to first token (TTFT)
ttftMs = max(350, (promptTokens / prefillTps) * 220)
A floor of 350 ms accounts for runtime initialization, memory allocation, and first-token scheduling latency that exist regardless of model size.
Safe context window
memorySafeContext = max(4096, floor((availableMemory / requiredMemory) * workloadContextTokens) ) maxSafeContext = min(memorySafeContext, officialContextTokens)
This ensures we never recommend a context length that would exceed the model's official limit or cause out-of-memory errors.
6
The compatibility score answers one question: “what's the best model I can run on my hardware?” It's a 0–100 composite that weights model quality most heavily (45 pts), treats fit as a prerequisite rather than a differentiator (25 pts), and rewards speed (15 pts) and VRAM efficiency (15 pts) without penalizing tight fits. The result maps to a letter grade.
compatibilityScore
One unified function used everywhere — quality is the dominant component, fit is a prerequisite (not a differentiator), and tight fits are never penalized:
fitBase (0-25) = ← prerequisite, not differentiator
native_fit (no offload) : 25
tight_fit (no offload) : 22 ← only 3pt gap from native
offloadRatio <= 0.10 : 12 (light offload)
offloadRatio <= 0.30 : 6 (moderate offload)
offloadRatio > 0.30 : 2 (heavy offload)
no_fit : 0
qualityScore (0-45) = qualityTier / 100 × 45 ← the main differentiator
speedScore (0-15) = min(15, log₂(decodeTps + 1) × 2.3)
efficiencyScore (0-15) = min(15, (vramUtil / 0.70) × 15)
↑ one-sided ramp: caps at 70% util, no penalty above
score = clamp(0, 100, fitBase + qualityScore + speedScore + efficiencyScore)Why quality-dominant? The user asks “what's the best model for my GPU?” — not “what's the most comfortable model?” Among models that fit, a tier-95 model beats a tier-50 model by ~20 points — enough for a full grade difference.
Why no tight-fit penalty? Using 90%+ of VRAM means you're running the biggest model that fits — that's a good thing. The one-sided ramp only penalizes under-utilization (wasting GPU capacity on a tiny model).
Grade thresholds
| Grade | Min score | Label |
|---|---|---|
| S | 85 | Excellent |
| A | 70 | Great |
| B | 55 | Good |
| C | 40 | Usable |
| D | 20 | Poor |
| F | < 20 | Won’t run |
7
The recommendation engine scores every eligible model against your hardware and workload, then returns the top 3 by composite score. The composite is not a simple weighted average — it is an additive sum of purpose-built sub-scores, each tuned to a different signal.
compositeScore =
qualityTier * (5.8 + qualityWeight * 3.2)
+ useCaseAdjustment
+ freshnessScore
+ legacyPenalty
+ fitBoost
+ sizeScore * (0.7 + qualityWeight)
+ architectureScore
+ utilizationScore * (0.55 + qualityWeight * 0.35)
+ speedScore
+ latencyScore
+ contextScoreQuality tier
The model's quality rating (1–5 scale) multiplied by a weight derived from the workload's quality preference. Higher-quality models score higher, especially for demanding workloads.
Fit boost
A direct bonus or penalty based on fit status: native_fit +24, tight_fit +12, hybrid_fit +3, unsafe_fit −12, no_fit −30. This ensures runnable models always rank above non-runnable ones.
Use-case specialization
Models tagged for the current workload (e.g. a code-specialist model for a coding workload) receive a significant boost (+14 to +26). Generalist models receive a penalty (−12 to −18) on specialist workloads.
Parameter fit score
Compares model size against workload-specific ideal parameter ranges that scale with available memory. Too small is penalized; the sweet spot scores highest; oversized models taper off.
Memory utilization
Rewards models that use your VRAM efficiently. The ideal utilization target scales with total available memory (52% for ≤12 GB up to 74% for >192 GB). Distance from the target reduces the score.
Speed score
Derived from decode TPS: min(12, sqrt(decodeTps)) * (0.55 + throughputWeight). Faster inference earns a meaningful but capped bonus.
Latency score
Derived from TTFT: min(6, 900 / max(250, ttftMs)) * (0.35 + latencyWeight * 0.5). Lower latency is rewarded.
Context alignment
Compares the workload's desired context length against the model's safe context window. Full coverage earns +7 to +11; shortfalls incur up to −26.
Freshness
Frontier models get a boost (+4.5 to +7), legacy models get a penalty (−8 or more). Models available on multiple distribution channels (HuggingFace, Ollama, LM Studio) earn small bonuses. MoE architectures and large context windows also earn freshness credit.
Architecture score
MoE/sparse models get a boost on coding, agentic-coding, and reasoning workloads (+5 to +6.5), especially on high-memory hardware (≥64 GB). Dense models rarely receive an architecture bonus unless they are very high quality.
8
For discrete GPUs, the engine supports multi-GPU configurations. Apple Silicon is always treated as a single device because its unified memory is already shared.
effectiveGpuCount = isAppleSilicon ? 1 : max(1, gpuCount) multiGpuScaling = gpuCount > 1 ? gpuCount * 0.85 : 1.0 totalMemory = singleGpuMemory * effectiveGpuCount bandwidth = memoryBandwidthGbps * multiGpuScaling fp16Tflops = fp16Tflops * multiGpuScaling
The 0.85 scaling factor accounts for inter-GPU communication overhead (tensor parallelism synchronization, PCIe/NVLink transfer latency). Two GPUs yield ~1.7x effective bandwidth and VRAM, not 2x.
9
The catalog that powers fit and recommendation scores is built from three layers of data:
Curated Catalog
Hand-verified hardware specs (GPUs, Apple Silicon, consumer and datacenter), runtime profiles (llama.cpp, Ollama, vLLM, etc.), model variants with quality tiers, and workload definitions. Stored as structured JSON in the repository and seeded into Postgres.
HuggingFace
Automated ingestion from the HuggingFace Hub API: parameter counts, architecture metadata, quantization availability, and download stats. Updated regularly to capture new model releases.
Ollama
The Ollama model library is ingested to track which models are immediately available via the most popular local inference runner. This feeds into distribution channel scoring.
All catalog data passes through ingestion pipelines that normalize specs, resolve conflicts between sources, and assign quality tiers. The Rust API loads the full catalog into memory at startup for sub-millisecond query response times.
For diffusion models (Flux, SDXL, Stable Diffusion, video generators), we use a different methodology tailored to how these models work.
Diffusion models have multiple components that share VRAM: the denoising backbone (UNet or DiT), VAE decoder, text encoder(s), and temporary activations. Unlike LLMs where KV cache scales with context length, diffusion memory scales with image resolution (quadratically with pixel count) and frame count for video.
Generation speed depends on model size, resolution, inference steps, and GPU compute. We calibrate against real community benchmarks (SD 1.5 ~2.5s, SDXL ~6s, Flux ~12s on RTX 4090 at native resolution). Speed scales sub-linearly with parameter count because larger models achieve better GPU utilization.
When real benchmark data exists for a model+hardware+resolution combination, we use the measured values instead of estimates. You'll see a green “Measured” indicator on results backed by real data. Our database includes measurements from official model cards, community benchmarks, and HuggingFace documentation.
We track diffusion runtimes including ComfyUI, Automatic1111/Forge, InvokeAI, and the diffusers Python library. Each has different optimization capabilities (xformers, torch.compile, sequential offloading) that affect real-world performance.
10
Several other tools answer “can I run this model?” — most notably canirun.ai. Here is how WillItRunAI goes further:
Dual engine (Rust + TypeScript)
The core scoring logic runs as a compiled Rust API for production accuracy, with a TypeScript mirror in the browser for instant offline estimates. Both implementations share the same formulas and thresholds.
Workload-aware recommendations
Most tools only check if a model fits. We rank models for your specific workload — coding, agentic coding, reasoning, RAG, or general chat — with different parameter targets, specialization boosts, and quality weights per workload.
Runtime-aware estimation
Different runtimes (llama.cpp, Ollama, vLLM, MLX) have different memory overheads and backend requirements. Our engine accounts for per-runtime overhead and recommends the best runtime for each hardware/model pair.
Multi-GPU support
We model tensor parallelism scaling with realistic efficiency factors, rather than simply multiplying VRAM. Inter-GPU communication overhead is factored into bandwidth and throughput estimates.
Offload modeling
Instead of a binary fits/doesn’t-fit answer, we model partial offload to system RAM with interconnect-specific penalties (PCIe 3.0 vs 4.0 vs Thunderbolt), host RAM validation, and graduated fit statuses.
Performance estimation
We estimate decode TPS, prefill TPS, TTFT, and safe context window — not just whether it fits. This lets you compare how well different models will actually perform on your hardware.
Transparent scoring
Every formula, weight, and threshold is documented on this page. Hover any score to see the exact 4-component breakdown (fit, quality, speed, efficiency). No opaque "AI scores" — you can reproduce any result with a calculator.
Curated quality tiers
Models are assigned quality tiers based on benchmark performance, community consensus, and architecture generation. Quality is the single largest factor in our compatibility score (45 of 100 points), ensuring we recommend the best model that fits — not just the most comfortable one.