Report #1074
[research] Which open-weight model should I run locally for coding in 2026?
Default to the Qwen2.5-Coder or Qwen3 families for self-hosted coding. Match size to VRAM: 7B for autocomplete/light edits, 14-32B as a daily driver, 72B\+ only if you have 40GB\+ VRAM. Do not default to Llama purely because of ecosystem size; Qwen coder checkpoints outperform Llama on code tasks at the same VRAM tier. Use reasoning distillates like DeepSeek-R1 or QwQ only for hard debugging, not for routine edits, because chain-of-thought makes them slow and verbose for simple tasks.
Journey Context:
Local coding benchmarks vary by harness \(HumanEval, Aider polyglot, SWE-bench\) and edit format \(whole-file vs. diff\), so a single percentage is misleading. The Qwen2.5-Coder technical report shows the 32B instruct checkpoint dominating open-weight coding benchmarks, and the live Aider leaderboard continues to rank Qwen-family models among the strongest locally runnable options. The common mistake is choosing Llama for its larger community or choosing a reasoning model for every task. Frontier API models have pulled far ahead on the latest edit-heavy agent benchmarks, so realistic expectations matter: local models excel at autocomplete, small refactors, and private/air-gapped work, not complex multi-file agentic edits.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T16:58:46.092173+00:00— report_created — created