Will It Run AI
qwen, qwen-3-6, qwen3.6-27b, 27b, dense, vram, gpu-requirements, gguf, coding

Qwen 3.6 27B VRAM & Hardware Requirements — Dense 27B GPU Guide (2026)

Qwen 3.6 27B: Q4_K_M ~16.8 GB fits RTX 4080 16GB. Flagship coding (77.2% SWE-bench) on a consumer GPU — GPU/Mac buyer guide and GGUF picks.

Qwen released Qwen3.6-27B on April 22, 2026 — a dense 27B model with vision capability that beats the previous-gen flagship Qwen3.5-397B-A17B MoE on SWE-bench Verified (77.2% vs 76.2%) while needing only a fraction of the hardware. If you have a 16-24GB GPU, this is the single most important open-weight release of 2026 Q2.

This page is the canonical reference for Qwen 3.6 27B VRAM requirements and Qwen 3.6 27B hardware requirements — exact GGUF quantization sizes, which GPU or Mac to buy at each tier, and how the dense 27B compares to the sibling 35B-A3B MoE. If you searched qwen 3.6 27b vram requirements or qwen 3.6 27b hardware requirements, you're in the right place.

Also in the Qwen 3.6 family: Qwen 3.6 35B-A3B MoE → — needs 24 GB VRAM, faster tok/s via MoE sparsity. For the original Qwen 3 and Qwen 3.5 families, see Qwen 3 / 3.5 GPU Requirements →.

Quick answers

  • Q4_K_M VRAM: ~16.8 GB — fits RTX 4080 16GB tight, RTX 4090 24GB comfortable
  • Q5_K_M VRAM: ~19.5 GB — RTX 4090, RTX 5090
  • Q6_K VRAM: ~22.5 GB — RTX 4090 24GB, RTX 5090 32GB, Mac M4 Max 36GB+
  • Q8_0 VRAM: ~28.6 GB — RTX 5090 32GB, Mac M4 Max 36GB
  • Full BF16: 55.6 GB on disk — needs H100 80GB or dual 4090
  • Context: 262K native, extensible to 1,010,000 tokens via YaRN
  • Architecture: Dense (no MoE), with Gated DeltaNet + Gated Attention hybrid
  • Release: April 22, 2026 on Hugging Face + ModelScope (Apache-2.0)

Exact GGUF quantization sizes

From the official unsloth/Qwen3.6-27B-GGUF repository:

QuantSizeRecommended for
UD-IQ2_M10.8 GBTight 12 GB GPUs (quality compromise)
Q3_K_M13.6 GB16 GB GPUs with long context
IQ4_XS15.4 GB16 GB GPUs balanced
Q4_015.8 GB16 GB GPUs baseline
Q4_K_S15.9 GB16 GB GPUs quality
IQ4_NL16.1 GB16 GB GPUs alternative
Q4_K_M16.8 GBDefault pick — best Q/size
UD-Q4_K_XL~17.0 GBUnsloth dynamic — recommended
Q4_117.3 GBOlder format, skip
Q5_K_S19.0 GB24 GB GPUs
Q5_K_M19.5 GB24 GB GPUs
Q6_K22.5 GB24 GB GPUs, precise coding
Q8_028.6 GB32 GB GPUs, near-lossless
BF1655.6 GBMulti-GPU / H100 80GB

Add 1-3 GB for KV cache at default context. At full 1M-token context via YaRN, KV cache can consume an additional 20-40 GB.

Can my GPU run Qwen3.6-27B?

GPUVRAMQwen3.6-27B fitBest quant
RTX 4060 Ti 8GB8 GB❌ Does not fit
RTX 3060 12GB12 GB⚠️ Q3_K_M tight, quality lossQ3_K_M
RTX 4060 Ti 16GB / RTX 4070 Ti 16GB16 GB✅ Q4_K_M fits ~0 headroomQ4_K_M
RTX 4080 16GB / RTX 4080 Super 16GB16 GB✅ Q4_K_M tightQ4_K_M
RTX 5070 Ti 16GB16 GB✅ Q4_K_M tightQ4_K_M
RX 7900 XTX 24GB24 GB✅ Q5_K_M comfortableQ5_K_M or Q6_K
RTX 3090 24GB24 GB✅ Q6_K comfortableQ6_K
RTX 4090 24GB24 GB✅ Q6_K idealQ6_K
RTX 5090 32GB32 GB✅ Q8_0 + long contextQ8_0
RTX 6000 Ada 48GB48 GB✅ Q8_0 or BF16 partialQ8_0 / BF16
H100 80GB80 GB✅ BF16 + 1M contextBF16
Mac M4 Pro 24GB24 GB unified✅ Q5_K_M comfortableQ5_K_M
Mac M4 Max 36GB36 GB unified✅ Q6_K or Q8_0Q6_K / Q8_0
Mac M4 Max 64GB64 GB unified✅ Q8_0 + long contextQ8_0
Mac Studio M3 Ultra 96GB96 GB unified✅ BF16BF16

Why Qwen3.6-27B is a big deal

Qwen's central claim: a 27B dense model matches or beats the previous-gen open-weight flagship, which was 397B total parameters (17B active) MoE. The dense architecture has concrete advantages for local inference:

  • No MoE routing: simpler inference stacks (works out of the box in llama.cpp, vLLM text mode, LM Studio).
  • Predictable latency: no expert-selection variance.
  • Fits on consumer GPUs: 16.8 GB at Q4 means RTX 4080 / 4090 / 3090 all serve it directly.
  • Multimodal out of the box: vision encoder included (images, OCR, hour-scale video).

Confirmed benchmarks (official model card)

Coding agents:

BenchmarkQwen3.6-27BQwen3.5-397B-A17B
SWE-bench Verified77.2%76.2%
SWE-bench Pro53.5%
SWE-bench Multilingual71.3%
Terminal-Bench 2.059.3%52.5%
SkillsBench Avg548.2%30.0%
QwenWebBench1487
NL2Repo36.2%
LiveCodeBench v683.9%

Knowledge + reasoning:

BenchmarkScore
MMLU-Pro86.2%
C-Eval91.4%
MMLU-Redux93.5%
GPQA Diamond87.8%
AIME 2026 (I & II)94.1%
HMMT Feb 202684.3%

Vision-language:

BenchmarkScore
MMMU82.9%
VideoMME (w/ sub.)87.7%
AndroidWorld70.3%
RefCOCO avg92.5%

Expected performance on common hardware

Community-reported numbers (will be updated as more benchmarks land):

HardwareQwen3.6-27B Q4_K_MNotes
RTX 4080 16GB~40 tok/sQ4 barely fits; short context
RTX 4090 24GB~55-60 tok/sQ4-Q6 comfortable; sweet spot
RTX 4090D 48GB~30 tok/s (Q6_K_XL at 262K ctx)Community report at full context
RTX 5090 32GB~75-85 tok/sQ6 ideal, Q8 tight
H100 80GB~130 tok/sBF16 serving
Mac M4 Pro 24GB~22 tok/sQ5_K_M
Mac M4 Max 36GB~28-32 tok/sQ6_K
Mac M4 Max 64GB~32-38 tok/sQ8_0 with long context

Qwen3.6-27B dense vs Qwen3.6-35B-A3B MoE

Same family, different tradeoffs:

AspectQwen3.6-27B denseQwen3.6-35B-A3B MoE
Total params27B35B
Active per token27B (dense)3B
VRAM Q4_K_M16.8 GB~21 GB
Coding (SWE-bench)77.2%~72%
Throughput at 24GBSlower (dense)Faster (MoE sparsity)
Vision/multimodalText-only
Best atPrecise coding, reasoningChat speed, agentic

Recommendation: For serious coding, pick 27B dense. For fast chat with multiple apps running, pick 35B-A3B MoE. See Qwen3.6-35B-A3B VRAM Requirements for the MoE sibling.

Quick start

llama.cpp (GGUF, recommended for single GPU)

# Download the Unsloth Dynamic Q4 GGUF
huggingface-cli download unsloth/Qwen3.6-27B-GGUF Qwen3.6-27B-UD-Q4_K_XL.gguf

# Run with llama.cpp server
./llama-server -m Qwen3.6-27B-UD-Q4_K_XL.gguf -c 262144 -ngl 99 --host 0.0.0.0 --port 8080

vLLM (multi-GPU or BF16)

pip install "vllm>=0.19.0" --torch-backend=auto

vllm serve Qwen/Qwen3.6-27B --port 8000 \
  --tensor-parallel-size 2 --max-model-len 262144 \
  --reasoning-parser qwen3

# Text-only mode (saves memory vs full multimodal)
vllm serve Qwen/Qwen3.6-27B --port 8000 --tensor-parallel-size 2 \
  --max-model-len 262144 --reasoning-parser qwen3 --language-model-only

SGLang (production serving)

pip install "sglang[all]>=0.5.10"

python -m sglang.launch_server --model-path Qwen/Qwen3.6-27B \
  --port 8000 --tp-size 2 --mem-fraction-static 0.8 \
  --context-length 262144 --reasoning-parser qwen3

LM Studio / Jan

Search the model catalog for Qwen3.6-27B. Pick Q4_K_M (16.8 GB) or Q6_K (22.5 GB) depending on available VRAM. Enable Metal (Mac) or CUDA (NVIDIA). Avoid CUDA 13.2 — it produces gibberish outputs on Qwen 3.6 as of April 2026; NVIDIA is working on a fix. Use CUDA 13.1 or 12.x.

Recommended sampling

For thinking-mode general tasks (from the official model card):

temperature=1.0, top_p=0.95, top_k=20, min_p=0.0

For precise coding specifically, drop temperature:

temperature=0.6, top_p=0.95, top_k=20, min_p=0.0

For non-thinking instruct mode:

temperature=0.7, top_p=0.80, top_k=20, presence_penalty=1.5

Coding-specific usage

Qwen3.6-27B is the strongest open-weight coding model at its size. Real-world integrations:

Cursor / Windsurf: Point to a local vLLM or llama.cpp server (OpenAI-compatible endpoint at http://localhost:8000/v1). Model name Qwen/Qwen3.6-27B or qwen3.6-27b.

Continue.dev:

{
  "models": [
    {
      "title": "Qwen 3.6 27B (local)",
      "provider": "openai",
      "model": "Qwen/Qwen3.6-27B",
      "apiBase": "http://localhost:8000/v1"
    }
  ]
}

Aider: aider --model openai/Qwen/Qwen3.6-27B --openai-api-base http://localhost:8000/v1 --openai-api-key EMPTY

Known compatibility gotchas

  • Ollama: NOT supported yet — needs separate mmproj vision files. Expected within days.
  • CUDA 13.2: Produces gibberish. Use CUDA 13.1 or 12.x.
  • Long context OOM: At 262K+ context the KV cache dominates memory. If you hit OOM, reduce --max-model-len or add GPUs via tensor-parallel-size.
  • Thinking mode: Default on. Output can be very long — budget 32K-81K tokens for coding responses.

Related guides

Sources

Frequently Asked Questions

How much VRAM does qwen3.6-27b need?

Qwen3.6-27B needs ~16.8 GB at Q4_K_M (Unsloth GGUF), ~19.5 GB at Q5_K_M, ~22.5 GB at Q6_K, and ~28.6 GB at Q8_0. Full BF16 weights are 55.6 GB on disk. The dense architecture removes MoE routing complexity so every parameter is loaded and active per token.

What GPU can run Qwen3.6-27B?

Q4_K_M (16.8 GB) fits on RTX 4080 16GB tight, RTX 4080 Super 16GB comfortable, RTX 4090 24GB, RTX 3090 24GB, RTX 5090 32GB, or Mac M4 Pro 24GB+. Q6_K needs 24GB VRAM minimum. Q8_0 wants 32GB or Mac M4 Max 36GB+.

What are the hardware requirements for Qwen 3.6 27B?

Minimum: a 16 GB GPU (RTX 4080 16GB, RTX 4070 Ti 16GB, RTX 5070 Ti 16GB) or 16 GB unified memory runs Q4_K_M at 16.8 GB. Recommended: a 24 GB GPU (RTX 3090, RTX 4090) or Mac M4 Pro 24GB for Q6_K headroom and longer context. For Q8_0 or full multimodal video, target 32 GB (RTX 5090) or a Mac M4 Max 36GB+.

Is Qwen3.6-27B really better than Qwen 3.5 397B-A17B for coding?

Yes, on the published benchmarks: SWE-bench Verified 77.2% vs 76.2%, Terminal-Bench 59.3% vs 52.5%, SkillsBench 48.2% vs 30.0%. A 27B dense model beating a 397B MoE is the biggest deal of Qwen 3.6 — it means flagship coding in a consumer-GPU footprint.

Is Qwen3.6-27B multimodal?

Yes. The 27B ships with a vision encoder and supports images, documents with OCR, and hour-scale video understanding (up to 224K video tokens). For pure text serving, use the --language-model-only flag in vLLM to save memory.

Does Qwen3.6-27B work in Ollama yet?

Not as of April 23, 2026. Ollama does not yet support the separate mmproj vision files Qwen 3.6 uses. Use llama.cpp directly, LM Studio (Q4_K_M or Q6_K GGUF), vLLM ≥0.19.0, or SGLang ≥0.5.10. Ollama support is expected shortly.

Qwen3.6-27B dense vs Qwen3.6-35B-A3B MoE — which to run?

27B dense wins on coding benchmarks (SWE-bench 77.2% vs 35B-A3B's ~72%) and fits in less VRAM (16.8 GB vs 21 GB Q4). 35B-A3B wins on raw throughput (MoE sparsity = faster tok/s despite loading more weights). For serious coding, prefer 27B. For chat speed on 24GB+, prefer 35B-A3B.

What is the best quantization for Qwen3.6-27B?

Q4_K_M is the mainstream pick at 16.8 GB — near-lossless for chat and most coding. For precise coding and math where syntax/step precision matters, step up to Q6_K (22.5 GB) if you have 24GB+. Unsloth recommends their UD-Q4_K_XL dynamic quant for best quality at ~17 GB.