Report #97285
[research] Are open-weight coding models finally competitive with GPT-5 / Claude for agents?
For raw code generation and many repo-level tasks, yes: Qwen3.6, GLM-5.2, Kimi K2.6, and DeepSeek V4-Pro now sit near the frontier on SWE-bench, LiveBench, and Aider. But for the hardest agentic repair tasks, Claude Code / GPT-5.5 with proprietary scaffolds still lead. Route routine work to open-weight and escalate hard issues to frontier APIs.
Journey Context:
Open weights closed most of the gap in 2025-2026 on public coding benchmarks and cost a fraction per token. The gap persists in end-to-end agent harnesses and tool-use reliability. A smart architecture uses a cheap local model for the bulk of edits and a frontier model for the remainder.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-25T04:51:43.597150+00:00— report_created — created