Qwen 3.6 27B VRAM & Hardware Requirements — Dense 27B GPU Guide (2026)
Qwen 3.6 27B: Q4_K_M ~16.8 GB fits RTX 4080 16GB. Flagship coding (77.2% SWE-bench) on a consumer GPU — GPU/Mac buyer guide and GGUF picks.
Qwen released Qwen3.6-27B on April 22, 2026 — a dense 27B model with vision capability that beats the previous-gen flagship Qwen3.5-397B-A17B MoE on SWE-bench Verified (77.2% vs 76.2%) while needing only a fraction of the hardware. If you have a 16-24GB GPU, this is the single most important open-weight release of 2026 Q2.
This page is the canonical reference for Qwen 3.6 27B VRAM requirements and Qwen 3.6 27B hardware requirements — exact GGUF quantization sizes, which GPU or Mac to buy at each tier, and how the dense 27B compares to the sibling 35B-A3B MoE. If you searched qwen 3.6 27b vram requirements or qwen 3.6 27b hardware requirements, you're in the right place.
Also in the Qwen 3.6 family: Qwen 3.6 35B-A3B MoE → — needs 24 GB VRAM, faster tok/s via MoE sparsity. For the original Qwen 3 and Qwen 3.5 families, see Qwen 3 / 3.5 GPU Requirements →.
Quick answers
- Q4_K_M VRAM: ~16.8 GB — fits RTX 4080 16GB tight, RTX 4090 24GB comfortable
- Q5_K_M VRAM: ~19.5 GB — RTX 4090, RTX 5090
- Q6_K VRAM: ~22.5 GB — RTX 4090 24GB, RTX 5090 32GB, Mac M4 Max 36GB+
- Q8_0 VRAM: ~28.6 GB — RTX 5090 32GB, Mac M4 Max 36GB
- Full BF16: 55.6 GB on disk — needs H100 80GB or dual 4090
- Context: 262K native, extensible to 1,010,000 tokens via YaRN
- Architecture: Dense (no MoE), with Gated DeltaNet + Gated Attention hybrid
- Release: April 22, 2026 on Hugging Face + ModelScope (Apache-2.0)
Exact GGUF quantization sizes
From the official unsloth/Qwen3.6-27B-GGUF repository:
| Quant | Size | Recommended for |
|---|---|---|
| UD-IQ2_M | 10.8 GB | Tight 12 GB GPUs (quality compromise) |
| Q3_K_M | 13.6 GB | 16 GB GPUs with long context |
| IQ4_XS | 15.4 GB | 16 GB GPUs balanced |
| Q4_0 | 15.8 GB | 16 GB GPUs baseline |
| Q4_K_S | 15.9 GB | 16 GB GPUs quality |
| IQ4_NL | 16.1 GB | 16 GB GPUs alternative |
| Q4_K_M | 16.8 GB | Default pick — best Q/size |
| UD-Q4_K_XL | ~17.0 GB | Unsloth dynamic — recommended |
| Q4_1 | 17.3 GB | Older format, skip |
| Q5_K_S | 19.0 GB | 24 GB GPUs |
| Q5_K_M | 19.5 GB | 24 GB GPUs |
| Q6_K | 22.5 GB | 24 GB GPUs, precise coding |
| Q8_0 | 28.6 GB | 32 GB GPUs, near-lossless |
| BF16 | 55.6 GB | Multi-GPU / H100 80GB |
Add 1-3 GB for KV cache at default context. At full 1M-token context via YaRN, KV cache can consume an additional 20-40 GB.
Can my GPU run Qwen3.6-27B?
| GPU | VRAM | Qwen3.6-27B fit | Best quant |
|---|---|---|---|
| RTX 4060 Ti 8GB | 8 GB | ❌ Does not fit | — |
| RTX 3060 12GB | 12 GB | ⚠️ Q3_K_M tight, quality loss | Q3_K_M |
| RTX 4060 Ti 16GB / RTX 4070 Ti 16GB | 16 GB | ✅ Q4_K_M fits ~0 headroom | Q4_K_M |
| RTX 4080 16GB / RTX 4080 Super 16GB | 16 GB | ✅ Q4_K_M tight | Q4_K_M |
| RTX 5070 Ti 16GB | 16 GB | ✅ Q4_K_M tight | Q4_K_M |
| RX 7900 XTX 24GB | 24 GB | ✅ Q5_K_M comfortable | Q5_K_M or Q6_K |
| RTX 3090 24GB | 24 GB | ✅ Q6_K comfortable | Q6_K |
| RTX 4090 24GB | 24 GB | ✅ Q6_K ideal | Q6_K |
| RTX 5090 32GB | 32 GB | ✅ Q8_0 + long context | Q8_0 |
| RTX 6000 Ada 48GB | 48 GB | ✅ Q8_0 or BF16 partial | Q8_0 / BF16 |
| H100 80GB | 80 GB | ✅ BF16 + 1M context | BF16 |
| Mac M4 Pro 24GB | 24 GB unified | ✅ Q5_K_M comfortable | Q5_K_M |
| Mac M4 Max 36GB | 36 GB unified | ✅ Q6_K or Q8_0 | Q6_K / Q8_0 |
| Mac M4 Max 64GB | 64 GB unified | ✅ Q8_0 + long context | Q8_0 |
| Mac Studio M3 Ultra 96GB | 96 GB unified | ✅ BF16 | BF16 |
Why Qwen3.6-27B is a big deal
Qwen's central claim: a 27B dense model matches or beats the previous-gen open-weight flagship, which was 397B total parameters (17B active) MoE. The dense architecture has concrete advantages for local inference:
- No MoE routing: simpler inference stacks (works out of the box in llama.cpp, vLLM text mode, LM Studio).
- Predictable latency: no expert-selection variance.
- Fits on consumer GPUs: 16.8 GB at Q4 means RTX 4080 / 4090 / 3090 all serve it directly.
- Multimodal out of the box: vision encoder included (images, OCR, hour-scale video).
Confirmed benchmarks (official model card)
Coding agents:
| Benchmark | Qwen3.6-27B | Qwen3.5-397B-A17B |
|---|---|---|
| SWE-bench Verified | 77.2% | 76.2% |
| SWE-bench Pro | 53.5% | — |
| SWE-bench Multilingual | 71.3% | — |
| Terminal-Bench 2.0 | 59.3% | 52.5% |
| SkillsBench Avg5 | 48.2% | 30.0% |
| QwenWebBench | 1487 | — |
| NL2Repo | 36.2% | — |
| LiveCodeBench v6 | 83.9% | — |
Knowledge + reasoning:
| Benchmark | Score |
|---|---|
| MMLU-Pro | 86.2% |
| C-Eval | 91.4% |
| MMLU-Redux | 93.5% |
| GPQA Diamond | 87.8% |
| AIME 2026 (I & II) | 94.1% |
| HMMT Feb 2026 | 84.3% |
Vision-language:
| Benchmark | Score |
|---|---|
| MMMU | 82.9% |
| VideoMME (w/ sub.) | 87.7% |
| AndroidWorld | 70.3% |
| RefCOCO avg | 92.5% |
Expected performance on common hardware
Community-reported numbers (will be updated as more benchmarks land):
| Hardware | Qwen3.6-27B Q4_K_M | Notes |
|---|---|---|
| RTX 4080 16GB | ~40 tok/s | Q4 barely fits; short context |
| RTX 4090 24GB | ~55-60 tok/s | Q4-Q6 comfortable; sweet spot |
| RTX 4090D 48GB | ~30 tok/s (Q6_K_XL at 262K ctx) | Community report at full context |
| RTX 5090 32GB | ~75-85 tok/s | Q6 ideal, Q8 tight |
| H100 80GB | ~130 tok/s | BF16 serving |
| Mac M4 Pro 24GB | ~22 tok/s | Q5_K_M |
| Mac M4 Max 36GB | ~28-32 tok/s | Q6_K |
| Mac M4 Max 64GB | ~32-38 tok/s | Q8_0 with long context |
Qwen3.6-27B dense vs Qwen3.6-35B-A3B MoE
Same family, different tradeoffs:
| Aspect | Qwen3.6-27B dense | Qwen3.6-35B-A3B MoE |
|---|---|---|
| Total params | 27B | 35B |
| Active per token | 27B (dense) | 3B |
| VRAM Q4_K_M | 16.8 GB | ~21 GB |
| Coding (SWE-bench) | 77.2% | ~72% |
| Throughput at 24GB | Slower (dense) | Faster (MoE sparsity) |
| Vision/multimodal | ✅ | Text-only |
| Best at | Precise coding, reasoning | Chat speed, agentic |
Recommendation: For serious coding, pick 27B dense. For fast chat with multiple apps running, pick 35B-A3B MoE. See Qwen3.6-35B-A3B VRAM Requirements for the MoE sibling.
Quick start
llama.cpp (GGUF, recommended for single GPU)
# Download the Unsloth Dynamic Q4 GGUF
huggingface-cli download unsloth/Qwen3.6-27B-GGUF Qwen3.6-27B-UD-Q4_K_XL.gguf
# Run with llama.cpp server
./llama-server -m Qwen3.6-27B-UD-Q4_K_XL.gguf -c 262144 -ngl 99 --host 0.0.0.0 --port 8080
vLLM (multi-GPU or BF16)
pip install "vllm>=0.19.0" --torch-backend=auto
vllm serve Qwen/Qwen3.6-27B --port 8000 \
--tensor-parallel-size 2 --max-model-len 262144 \
--reasoning-parser qwen3
# Text-only mode (saves memory vs full multimodal)
vllm serve Qwen/Qwen3.6-27B --port 8000 --tensor-parallel-size 2 \
--max-model-len 262144 --reasoning-parser qwen3 --language-model-only
SGLang (production serving)
pip install "sglang[all]>=0.5.10"
python -m sglang.launch_server --model-path Qwen/Qwen3.6-27B \
--port 8000 --tp-size 2 --mem-fraction-static 0.8 \
--context-length 262144 --reasoning-parser qwen3
LM Studio / Jan
Search the model catalog for Qwen3.6-27B. Pick Q4_K_M (16.8 GB) or Q6_K (22.5 GB) depending on available VRAM. Enable Metal (Mac) or CUDA (NVIDIA). Avoid CUDA 13.2 — it produces gibberish outputs on Qwen 3.6 as of April 2026; NVIDIA is working on a fix. Use CUDA 13.1 or 12.x.
Recommended sampling
For thinking-mode general tasks (from the official model card):
temperature=1.0, top_p=0.95, top_k=20, min_p=0.0
For precise coding specifically, drop temperature:
temperature=0.6, top_p=0.95, top_k=20, min_p=0.0
For non-thinking instruct mode:
temperature=0.7, top_p=0.80, top_k=20, presence_penalty=1.5
Coding-specific usage
Qwen3.6-27B is the strongest open-weight coding model at its size. Real-world integrations:
Cursor / Windsurf: Point to a local vLLM or llama.cpp server (OpenAI-compatible endpoint at http://localhost:8000/v1). Model name Qwen/Qwen3.6-27B or qwen3.6-27b.
Continue.dev:
{
"models": [
{
"title": "Qwen 3.6 27B (local)",
"provider": "openai",
"model": "Qwen/Qwen3.6-27B",
"apiBase": "http://localhost:8000/v1"
}
]
}
Aider: aider --model openai/Qwen/Qwen3.6-27B --openai-api-base http://localhost:8000/v1 --openai-api-key EMPTY
Known compatibility gotchas
- Ollama: NOT supported yet — needs separate mmproj vision files. Expected within days.
- CUDA 13.2: Produces gibberish. Use CUDA 13.1 or 12.x.
- Long context OOM: At 262K+ context the KV cache dominates memory. If you hit OOM, reduce
--max-model-lenor add GPUs viatensor-parallel-size. - Thinking mode: Default on. Output can be very long — budget 32K-81K tokens for coding responses.
Related guides
- Qwen 3.6 VRAM & Hardware Requirements (35B-A3B MoE) — the sibling MoE variant + buyer guide
- Qwen3.6-35B-A3B Release Date
- Qwen 3.5 27B VRAM Requirements — the previous-gen dense 27B
- Qwen 3.6 vs Gemma 4 — head-to-head 27B dense comparison
- What Can You Run on 16GB, 24GB, 32GB VRAM?
- Best Local Coding LLMs for Apple Silicon 24GB
- VRAM Calculator
Sources
- Official model card: Qwen/Qwen3.6-27B on Hugging Face
- GGUF quants: unsloth/Qwen3.6-27B-GGUF
- Qwen Team blog: qwen.ai/blog
- Unsloth local guide: Qwen3.6 Local Inference