Report #99744
[research] Which local open-weights model should I use for coding on a single 24GB consumer GPU?
Use Qwen2.5-Coder-32B-Instruct quantized to Q4\_K\_M \(or Q5\_K\_M if you have headroom\). It leads HumanEval\+/MBPP\+ among dense sub-40B coding models, fits in ~18GB VRAM, and serves cleanly via llama.cpp/Ollama or vLLM. For 8-16GB cards, fall back to Qwen2.5-Coder-14B or 7B instead of a frontier MoE.
Journey Context:
Frontier MoEs \(Qwen3-Coder 480B, DeepSeek-V3.2\) are far too large for one GPU and their published scores often come from multi-GPU/API scaffolding. The 32B dense Qwen2.5-Coder family is the practical local ceiling; 7B/14B variants trade a modest accuracy drop for fitting consumer VRAM. EvalPlus/HumanEval numbers show 32B > 14B > 7B by clear margins, and code-specific pretraining beats general chat tuning for coding. Q4\_K\_M quantization preserves most coding ability while making serving feasible.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-30T04:59:05.517841+00:00— report_created — created