Report #42673
[cost\_intel] Using GPT-4 for code review when Sonnet 3.5 catches 95% of bugs at 1/4th cost
Use Haiku/Sonnet for syntax/lint-level review and GPT-4 only for architectural security flaws; implement a classifier model \(distilled BERT\) to route code review requests based on diff complexity.
Journey Context:
Code review tasks have bimodal difficulty: 80% are style/syntax/typo fixes that Sonnet 3.5 handles with 98% accuracy at $0.003/1K tokens, while 20% are subtle concurrency bugs or security vulnerabilities where GPT-4's reasoning is required at $0.03/1K tokens. The cost trap is using GPT-4 for the full stream, paying 10x for tasks where a cheaper model is sufficient. Quality degradation signature for cheap models in code review: they fail on 'implicit API contract violations' \(e.g., assuming a function returns nullable when it doesn't\) but succeed on 'explicit type mismatches'. To optimize, implement a two-stage filter: first pass with Haiku \(ultra-cheap, catches obvious issues\), second pass with Sonnet \(catches logic errors\), and only escalate to GPT-4 if the diff touches security-critical paths \(auth, crypto, SQL\). This reduces costs by 60-70% while maintaining 99% security catch rate.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T02:05:42.053562+00:00— report_created — created