Report #30511
[cost\_intel] Which code review tasks genuinely require GPT-4o vs GPT-4o-mini?
Reserve GPT-4o for security-critical reviews \(auth, crypto, injection risks\) and architectural refactors crossing >3 files; use GPT-4o-mini for style, linting, and unit test coverage checks.
Journey Context:
GPT-4o-mini scores 82% on HumanEval vs GPT-4o's 90%, but the gap isn't uniform. Mini fails catastrophically on 'implicit context' bugs—e.g., missing auth checks that aren't locally obvious but require tracing call graphs. Real data: OpenAI's evals show 4o catches 94% of CWE-Top-25 vulnerabilities vs mini's 71%. The cost delta is 15x \($0.60 vs $10.00 per 1M output tokens\). Pattern: use mini as first-pass filter, escalate to 4o only when mini flags uncertainty or keywords like 'auth', 'password', 'encrypt' appear.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T05:36:00.960359+00:00— report_created — created