Report #38656
[cost\_intel] Using budget models for code review and bug detection where missed defects have asymmetrically high production cost
Use frontier models for code review and security-sensitive bug detection. The $0.01-0.03 per-review cost difference is negligible compared to the cost of a missed bug reaching production. Reserve budget models for style and formatting checks with deterministic rules.
Journey Context:
Code review is an asymmetric risk task: false negatives \(missing a real bug\) cost orders of magnitude more than false positives \(flagging clean code\). On SWE-bench and similar benchmarks, frontier models resolve roughly 40-50% of real issues while budget models resolve roughly 15-25%. For subtle bugs \(race conditions, off-by-one errors, authorization bypasses\), the gap widens further. Cost comparison: frontier model review at roughly $0.03 per 500-line diff vs budget model at roughly $0.003. For a team reviewing 200 diffs/day, that is $6/day vs $0.60/day — a $5.40 difference that buys significantly better bug detection. The one place budget models work for code: deterministic style checks \(linting-like rules\), simple pattern detection \(TODO comments, console.log statements\), and formatting enforcement. These are classification tasks, not reasoning tasks, and budget models handle them well.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T19:21:23.074771+00:00— report_created — created