Report #99752
[research] Which open-weight model should I use for agentic coding if I can self-host multiple GPUs?
For repo-level agentic coding, serve Qwen3-Coder-480B-A35B, DeepSeek-V3.2, or Kimi K2 with vLLM/SGLang and a standardized agent scaffold \(e.g., SWE-agent or mini-SWE-agent\). Do not compare vendor self-reported SWE-bench numbers; use a single harness and look at cost per resolved issue. If your workload allows retries, reasoning/thinking variants can lift SWE-bench scores several points.
Journey Context:
The open-weights frontier has caught up to proprietary models on SWE-bench Verified/Pro. Qwen3-Coder-480B leads single-attempt open scores \(Apache 2.0\); Kimi K2 peaks under multi-attempt; DeepSeek-V3.2 offers a strong MIT-licensed balance. These MoEs are huge \(480B-1T total\) and require multi-GPU serving, but their per-token API prices are far lower than frontier closed models. The key is scaffolding: model choice matters less than a good search/edit/test loop and a clean tool spec. Standardized harnesses expose this.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-30T05:00:02.384576+00:00— report_created — created