Qwen 3.5 on RTX 4090 — VRAM, Tokens/s, Best Runtime, and What Actually Fits
How well does Qwen 3.5 run on an RTX 4090 24GB? Practical guidance for Qwen 3.5 9B, 27B, and 35B A3B with VRAM requirements, tok/s estimates, and runtime tradeoffs.
If you own an RTX 4090 24GB, Qwen 3.5 is one of the model families you should care about most.
The good news: this is a very strong card for Qwen.
The bad news: "Qwen 3.5 on 4090" is not one answer. It depends on:
- which Qwen 3.5 size you mean
- which quantization you use
- which runtime you choose
- whether your workload is short-chat, coding, or long-context
That is exactly where most shallow benchmark threads stop being useful.
The Fast Answer
On a single 24GB RTX 4090:
- Qwen 3.5 9B is easy and fast
- Qwen 3.5 27B is the interesting edge of practicality
- Qwen 3.5 35B A3B is possible, but only if you accept more compromises than the raw model name suggests
Our Practical Baseline
For this page, the most useful baseline is:
- runtime:
llama.cpp - quant: the most practical quant per model
- hardware: single RTX 4090 24GB
- focus: local single-user use, not a large serving cluster
That matches how many serious local users actually approach a 4090.
The Numbers That Matter
Best practical single-GPU view
| Model | Quant | Memory Needed | Fit on 4090 | Decode Speed | Practical read |
|---|---|---|---|---|---|
| Qwen 3.5 9B | Q5_K_M | 12.0 GB | Native fit | 126 tok/s | Excellent daily-driver tier |
| Qwen 3.5 27B | Q4_K_M | 22.9 GB | Hybrid on coding-style workload | 52 tok/s | Very strong, but memory is tight |
| Qwen 3.5 27B | Q3_K_M | 19.7 GB | Tight fit | 60.3 tok/s | Often the cleanest practical 27B answer |
| Qwen 3.5 35B A3B | Q4_K_M | 26.1 GB | Unsafe fit | 42 tok/s | Too tight for a clean single-4090 setup |
| Qwen 3.5 35B A3B | Q3_K_M | 21.9 GB | Tight fit | 42 tok/s | The more realistic way to run it |
The key point is this:
The RTX 4090 is amazing for Qwen 9B, good-to-very-good for Qwen 27B, and compromise territory for Qwen 35B A3B unless you quantize harder.
Qwen 3.5 9B on RTX 4090
This is the easy win.
Qwen 3.5 9B on 4090 is in the "stop overthinking it and just run it" category.
Why it works so well:
- enough VRAM to run high-quality quantization comfortably
- enough bandwidth to keep decode speed high
- enough headroom for long context and background overhead
If you want one Qwen model that feels obviously great on a single 4090, this is the answer for most people.
Qwen 3.5 27B on RTX 4090
This is the real question.
People ask about 27B because it is the point where 24GB starts to matter. And the answer is:
- for chat-style use:
Q4_K_Mis realistic - for coding-style use with larger context:
Q4_K_Mbecomes tight enough that runtime overhead really matters - for a cleaner daily setup:
Q3_K_Mis often the more comfortable compromise
That is why you see mixed takes online.
They are often all looking at slightly different assumptions.
The runtime difference matters
For Qwen 3.5 27B on a 4090, the runtime choice is not cosmetic:
| Runtime | Chat-style memory need | Coding-style memory need | Why it matters |
|---|---|---|---|
llama.cpp | 21.4 GB | 22.9 GB | Lowest overhead, best if you want to squeeze the card cleanly |
Ollama | 21.7 GB | 23.2 GB | Easier to use, slightly tighter memory story |
vLLM | 22.9 GB | 24.4 GB | Better for serving, but the extra overhead makes 24GB feel small quickly |
This is exactly why "just use the best runtime on paper" is the wrong answer. On a single 24GB card, lower-overhead local runtimes often make the setup materially better.
Qwen 3.5 35B A3B on RTX 4090
This is where people see "A3B" and assume the 4090 will handle it effortlessly.
Not quite.
The active parameters per token are small, which helps throughput. But the full model still has to fit well enough in memory. That is why the speed can look okay while the fit story still feels awkward.
The right conclusion is:
Q4_K_Mis too tight to recommend cleanly as the default answerQ3_K_Mis much more realistic- if you truly want this class of model without compromise, you are now in
48GB+, multi-GPU, or bigger-memory-platform territory
Best Runtime by Goal
Best for personal local use
llama.cpp
Why:
- lower memory overhead
- practical on a single 24GB card
- strong fit for GGUF-style local workflows
Best for convenience
Ollama
Why:
- easiest setup
- still very viable for Qwen 9B
- still good for 27B if you accept the tighter memory budget
Best for API serving
vLLM
Why:
- better batching and serving behavior
- better answer for throughput-focused use
- but on a single 4090 it is not the best way to squeeze borderline models into 24GB
So What Should a 4090 Owner Actually Run?
Use this rule:
- If you want the easiest great experience: Qwen 3.5 9B
- If you want the strongest single-4090 dense Qwen tier: Qwen 3.5 27B
- If you want to experiment with the MoE option: Qwen 3.5 35B A3B, but accept that quantization becomes the real decision
That is the honest hierarchy.
When the 4090 Stops Being Enough
You are leaving single-4090 territory when:
- you want 27B with more context and less compromise
- you want 35B-A3B without hard quantization tradeoffs
- you want higher-overhead serving runtimes and still want lots of memory headroom
- you want the next step toward
70Band above
At that point the path becomes:
- a bigger-memory card
- a multi-GPU plan
- or a platform with more shared/unified memory
If that is where you are headed, read Multi-GPU LLM Inference, Ollama Multi-GPU, and How to Build a Local AI Workstation in 2026.