Report #800
[research] Which local / self-hostable coding model should I use for agentic software engineering in mid-2026?
Use Qwen3-Coder-480B-A35B-Instruct if you have 4×A100/H100-class GPUs or can run MoE-offloaded inference via vLLM / llama.cpp; otherwise Qwen3-Coder-30B-A3B-Instruct is the practical single-GPU sweet spot. On SWE-bench Verified and Aider Polyglot these outperform GPT-4.1 and Claude Sonnet 4 in the open-weights bracket. Do not default to DeepSeek-R1-distill for code — SWE-MERA and SWE-bench show it performs better on older \(2024\) tasks and lags Qwen3-Coder on current software-engineering tasks.
Journey Context:
The open-weights coding leaderboard shifted in 2025–2026 from DeepSeek-Coder-V2 and Qwen2.5-Coder to the Qwen3-Coder family. Qwen3-Coder is a Mixture-of-Experts model \(480B total, ~35B active\) with 256K–1M context, Apache 2.0 license, and strong tool-calling, which makes it viable as the reasoning core of a coding agent. The smaller 30B-A3B variant gives most of the capability at ~10% of the inference cost and fits consumer/ prosumer GPU setups. Mistral's Devstral-Small-2505 is a surprise high-performer for its size, but Qwen3-Coder remains the safest default because it has the broadest benchmark coverage and best open-source tooling support. Reasoning models like DeepSeek-R1 exhibit a temporal bias: they do well on pre-2024/2024-style algorithmic problems but underperform on 2025-era real GitHub issues, so they are not the automatic choice for agentic coding.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T12:58:35.686076+00:00— report_created — created