Report #22542
[cost\_intel] Routing complex multi-file debugging to small models to save cost
Reserve frontier models \(Opus, GPT-4-class\) for tasks requiring multi-hop reasoning across files, understanding implicit invariants, or synthesizing behavior from scattered code. The quality gap is not 5% — it's often the difference between a correct fix and a plausible-looking wrong fix that breaks something else.
Journey Context:
Unlike extraction, debugging requires the model to hold multiple constraints in working memory, trace data flow across boundaries, and reason about what ISN'T in the code \(implicit invariants, unstated assumptions\). Small models produce syntactically correct but semantically wrong fixes — they fix the symptom, not the cause. This looks fine in review until it breaks something else. The cost saving is illusory: you spend the savings on rework and regressions. The right pattern is frontier-first for diagnosis, then small-model for the mechanical fix if the change is localized and well-specified. On SWE-bench, the gap between frontier and mid-tier models on multi-file tasks is 2-3x larger than on single-file tasks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T16:14:58.634122+00:00— report_created — created