InternLM

InternLM 20B

Name: InternLM 20B
Rating: 56 (64 reviews)
Author: InternLM

Legacy

11.0KDownloads95LikesJul 2024Released8K tokensContextInternLMLicense22 EntryQuality

InternLM 20B (20B parameters) requires approximately 34.2 GB of VRAM with Q5_K_M quantization. For the best balance of quality and speed, we recommend hardware with at least 40 GB of VRAM.

Get started

— copy & paste to run locally

Copy-paste commands to run InternLM 20B on your machine.

Run

docker run --rm -it ghcr.io/ggerganov/llama.cpp:full \
  --hf-repo "internlm/internlm2_5-20b-chat" \
  --hf-file "internlm2_5-20b-chat-Q5_K_M.gguf" \
  -c 4096 -ngl 99

Quick specs

Parameters20B

Architecturedense

Context8K tokens

Modalitytext

Min RAM7.8 GB

Rec. RAM14.4 GB (Q5_K_M)

LicenseInternLM

FamilyInternLM

✓ Code✓ Chat

About this model

InternLM2.5 has open-sourced a 20 billion parameter base model and a chat model tailored for practical scenarios. The model has the following characteristics:

•Outstanding reasoning capability: State-of-the-art performance on Math reasoning, surpassing models like Llama3 and Gemma2-27B
•Stronger tool use: InternLM2.5 supports gathering information from more than 100 web pages, corresponding implementation has be released in...

Related models

Inference speed

InternLM 20B inference speed — tokens per second by GPU & Mac

Estimated decode speed (tokens/sec) for InternLM 20B at Q5_K_M across popular GPUs and Apple Silicon, using the fastest local runtime per device. Fastest is RTX 5090 32GB at ~45 tok/s. Speed is memory-bandwidth bound, so cards that fit the whole model in VRAM run far faster than ones that offload to system RAM.

GPU / Mac	Memory	Quant	Speed (tok/s)	Fits?
RTX 5090 32GB	32 GB	Q5_K_M	44.5	Heavy offload
Mac Studio M3 Ultra 256GB	256 GB	Q5_K_M	39.4	Fits
Mac Studio M2 Ultra 128GB	128 GB	Q5_K_M	32.9	Fits
Mac Studio M1 Ultra 128GB	128 GB	Q5_K_M	31.2	Fits
MacBook Pro M4 Max 128GB	128 GB	Q5_K_M	30.7	Fits
MacBook Pro M4 Max 64GB	64 GB	Q5_K_M	30.7	Tight
MacBook Pro M3 Max 64GB	64 GB	Q5_K_M	17.0	Tight
MacBook Pro M4 Pro 48GB	48 GB	Q5_K_M	16.1	Heavy offload
RX 7900 XTX 24GB	24 GB	Q5_K_M	15.6	Too big
MacBook Pro M1 Max 64GB	64 GB	Q5_K_M	15.6	Tight
RTX 4090 24GB	24 GB	Q5_K_M	14.0	Too big
RTX 3090 24GB	24 GB	Q5_K_M	12.9	Too big
RTX 4080 Super 16GB	16 GB	Q5_K_M	5.3	Too big
RTX 4070 12GB	12 GB	Q5_K_M	3.3	Too big
RTX 3060 12GB	12 GB	Q5_K_M	2.2	Too big
RTX 4060 8GB	8 GB	Q5_K_M	2.0	Too big

Estimates for single-stream decoding at Q5_K_M; real tokens/sec varies with prompt length, context, batch size, and runtime build. Prompt processing (prefill) is faster than the decode figures shown here.

Quick picks

Best budgetC

Mac mini M4 64GB~$1,099 — 8 tok/s

Best overallB

RTX PRO 5000 Blackwell 48GB~$4,999 — 80 tok/s

Best hardware

Top picks for InternLM 20B

RTX PRO 5000 Blackwell 48GBB

48 GB

RTX 6000 Ada 48GBB

48 GB

AMD Instinct MI210 64GBB

Run this model

InternLM 20B on RTX PRO 5000 Blackwell 48GB InternLM 20B on RTX 6000 Ada 48GB InternLM 20B on AMD Instinct MI210 64GB

Quantization

InternLM 20B quantization — VRAM & quality by quant level

How much VRAM InternLM 20B (20B) needs at each GGUF quant, and whether it fits a 24 GB card (RTX 4090 / 3090). The recommended Q5_K_M uses ~14.4 GB — about 33% less VRAM than Q8_0, at a small quality cost.

Quant	Bits	VRAM (weights)	Quality	Fits 24 GB?
Q2_K	2	7.8 GB	Low	Too big
Q3_K_S	3	9.8 GB	Low	Too big
NVFP4	4	11.2 GB	Medium	Too big
Q4_K_M	4	12.2 GB	Medium	Too big
Q5_K_Mrecommended	5	14.4 GB	High	Too big
Q6_K	6	16.4 GB	High	Too big
Q8_0	8	21.4 GB	Very High	Too big
F16	16	41 GB	Maximum	Too big

VRAM shown is quantized weights only; add ~1–3 GB runtime overhead plus KV cache for your context length. Lower quants trade quality for memory — Q4_K_M is the usual sweet spot; Q2/Q3 only when you must fit a bigger model.

Quality benchmarks

InternLM 20B benchmark scores

Benchmark verified

Reasoning

MMLU-Pro33.3%

GPQA Diamond9.5%

MATH-50040.8%

ARC Challenge—

General

Chatbot Arena—

IFEval70.1%

Source: community · 2025-01-01

Hardware compatibility

Fit estimates across all hardware

Open calculator

Computing compatibility...

Memory breakdown

Reference: RTX 2060 6GB

Weights14.4 GB

KV Cache18.3 GB

Runtime0.9 GB

Headroom0.6 GB

Frequently asked questions

FAQ — InternLM 20B

How much VRAM does InternLM 20B need?

InternLM 20B (20B parameters) requires approximately 34.2 GB of VRAM with Q5_K_M quantization. Lower quantizations like Q4_K_M use less memory but may reduce quality.

Can I run InternLM 20B on a Mac mini M4 64GB?

Yes, Mac mini M4 64GB can run InternLM 20B with a compatibility score of 54/100. It provides 64 GB of memory and achieves approximately 8.0 tokens per second.

What is the best quantization for InternLM 20B?

The recommended quantization for InternLM 20B is Q5_K_M, which offers the best balance between model quality and memory efficiency. Higher quantizations preserve more quality but require more VRAM.

What hardware is recommended for InternLM 20B?

The top recommended hardware for InternLM 20B: RTX PRO 5000 Blackwell 48GB (score: 64/100), RTX 6000 Ada 48GB (score: 63/100), AMD Instinct MI210 64GB (score: 63/100). These provide the best combination of memory, bandwidth, and compute for running this model locally.

Is InternLM 20B good for chat?

Yes, InternLM 20B is well-suited for chat as well as coding. It was designed with these use cases in mind.