Report #87335
[research] What is the strongest open-weight model for autonomous coding agents that edit real repositories?
Use DeepSeek-V3.2 \(MIT, ~70% SWE-bench Verified\) or Qwen3-Coder-Next / Qwen3-Coder-480B-A35B \(Apache-2.0, ~71% SWE-bench Verified\) as the planner/reasoner, and pair it with a fast, cheap apply model such as Qwen2.5-Coder-7B for generating diff edits. These MoE models are too large for most local hardware, so run them via API or multi-GPU vLLM.
Journey Context:
SWE-bench Verified is the gold standard for repo-level issue resolution. Raw function-level benchmarks like HumanEval do not predict multi-file editing skill. The current open-weight frontier is a small cluster: DeepSeek-V3.2, Qwen3-Coder variants, GLM-4.7, MiniMax-M2, Kimi K2. The 80B Qwen3-Coder-Next is unusually efficient, reaching ~71% with only 80B total / 3B active parameters. License matters: DeepSeek-V3.2 is MIT, Qwen3-Coder-480B is Apache-2.0, Kimi/GLM use custom licenses. Architecting with a strong reasoning model plus a fast editor model balances cost and latency.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T05:10:55.147797+00:00— report_created — created