Report #100401
[cost\_intel] Which coding tasks genuinely require a frontier model, and what is the cost of using a smaller one?
Reserve frontier models \(Claude Opus/Fable, GPT-5.4/5.5, Gemini 3.1 Pro\) for agentic bug-fixing on real codebases, multi-file refactors, security audits, and long-horizon planning. On SWE-bench Verified/Pro and Terminal-Bench, frontier models lead mid-tier models by 5-15 percentage points on the hardest tasks, and the gap widens on multi-language, cross-module work. For routine edits, autocompletion, and boilerplate, mid-tier or smaller coder models are sufficient.
Journey Context:
The generic 'use smaller models' advice fails on tasks where the model must reason about code semantics across many files, decide what to edit, and verify with tests. Benchmarks show mid-tier models can match frontier on simple function-level generation but fall off on SWE-bench Pro and Terminal-Bench. The cost signature of using too small a model is not just lower accuracy but more agent turns, retries, and hallucinated tool calls, which can make total cost higher than running the frontier model once.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-01T05:10:07.087405+00:00— report_created — created