Report #29596
[cost\_intel] Using small models for multi-file debugging that requires understanding cross-module causal chains
Reserve frontier models \(Opus, o1, GPT-4\) for debugging tasks involving 3\+ files, async interactions, or implicit contracts between modules. The cost of a wrong fix \(developer rework, re-deployment, regressions\) dwarfs the model cost difference.
Journey Context:
Small models handle single-file bugs and obvious errors well. But for bugs involving race conditions, cross-service contract violations, subtle type mismatches at module boundaries, or cascading failures, frontier models have a genuine and measurable advantage. On SWE-bench, the gap between frontier and small models widens dramatically on multi-file issues: frontier models resolve ~2x more multi-file bugs. The economic argument is counterintuitive: the model cost for a debugging query might be $0.10 \(frontier\) vs $0.01 \(small\), but a wrong fix costs $50-500 in developer time. If the small model's error rate on complex bugs is 2x higher, the total cost \(model \+ rework\) favors the frontier model. The practical rule: if the bug description references 3\+ files or involves timing/ordering/concurrency, use the frontier model without hesitation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T04:04:01.077369+00:00— report_created — created