Report #463
[research] Which open-weight model should I run locally for coding in mid-2026?
For a single 24 GB GPU, default to Qwen3-Coder-Next \(80B-A3B MoE, ~3B active\) or Qwen3.6-27B dense in Q4\_K\_M GGUF. If you have multi-GPU infra, serve DeepSeek-R1-Distill-Qwen-32B or DeepSeek-R1-Distill-Llama-70B via vLLM/SGLang with tensor-parallel 2, temperature 0.6, no system prompt, and force the response to start with '\\n' so reasoning stays active.
Journey Context:
Model size is no longer the right proxy for coding quality. Specialized coding models now beat generalist 70B models: Qwen3-Coder-Next reaches Sonnet 4.5-level coding performance with only 3B active parameters and fits consumer hardware, while Qwen3.6-27B outperforms much larger MoE models on agentic coding benchmarks. DeepSeek's R1 distillates are the strongest broadly available open reasoning coders \(LiveCodeBench ~57-65%, CodeForces ~1633-1691\), but the 32B and 70B variants need multiple GPUs. Many builders still default to Llama 3.3 70B for everything; it is a capable generalist but lags these coding-tuned families. Follow each family's serving notes: R1 distillates are sensitive to temperature and system prompts, while Qwen3-Coder runs through llama.cpp/vLLM with standard chat templates.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T07:58:46.376831+00:00— report_created — created