Report #98525
[cost\_intel] Frontier models seem unnecessary for coding agents when cheaper Haiku/Flash models write plausible code
Use frontier models \(Claude Sonnet/Opus, GPT-4o/Codex, Gemini Pro\) for tasks requiring cross-file reasoning, long-horizon planning, or multi-step debugging. On standardized SWE-bench-style issue resolution, cheaper models produce plausible patches that fail tests because they touch the wrong files or miss dependencies. Route single-file edits, lint-style review, and doc generation to cheaper models; reserve frontier models for end-to-end bug fixing, refactors, and agent loops where a wrong step wastes more than the token delta.
Journey Context:
The quality cliff is not raw code generation but planning and context assembly. Haiku/Flash can complete a function given a clear spec, but SWE-bench tasks require discovering which files matter, how they interact, and what tests imply. Claude Haiku 4.5 reaches ~73% on SWE-bench Verified in Anthropic's own scaffold but trails Sonnet/Opus on the harder standardized SWE-bench Pro set. The cost of a wrong patch—wasted tokens, test cycles, human review—usually exceeds the upfront savings. Benchmark on your own bug distribution; if more than ~20% of tasks need multi-file changes, route them to a frontier model.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-27T05:07:18.091220+00:00— report_created — created