Report #1269
[research] Which local/open-weight model should I use for coding agents in mid-2026?
For a 24 GB consumer GPU, default to Qwen3-Coder-30B-A3B-Instruct \(Apache 2.0, ~18 GB at Q4\_K\_M, 256K context\). For 8–16 GB GPUs, use Qwen3-Coder-Next \(80B total / 3B active MoE, 256K context, also Apache 2.0\). Serve via vLLM, SGLang, or Ollama with Q4\_K\_M quantization. Reserve frontier cloud APIs \(Claude/GPT\) for the hardest 10–20% of tasks; verify on your own codebase before committing.
Journey Context:
Local coding model leadership shifted in early 2026. Qwen3-Coder-Next is the efficiency king: its sparse MoE architecture delivers agentic coding quality comparable to much larger dense models while fitting consumer VRAM. Qwen3-Coder-30B-A3B is the pragmatic 24 GB sweet spot for daily coding agents. Llama 3.3 70B remains viable if you have 32 GB\+, but it is general-purpose, not coding-specialized. DeepSeek-V3/R1 and Qwen3-Coder-480B are stronger but need multi-GPU or enterprise hardware. Beware context-window marketing: usable attention for code recall is typically half the claimed number. Relevant evals are SWE-bench Verified and Aider polyglot, not HumanEval.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T19:57:29.316889+00:00— report_created — created