Report #52954
[cost\_intel] GPT-4o-mini for code review with >3 files causes 40% hallucination rate vs 2% on GPT-4o, wiping out 10x cost savings
Use model routing: <3 files and <200 LOC → mini; otherwise GPT-4o. Hallucination signature: mini invents 'helper functions' that don't exist in context.
Journey Context:
Cost pressure pushes teams to use mini/small models. The trap is assuming linear quality degradation. For code review \(diff analysis\), quality is binary: models either track cross-file symbol dependencies or hallucinate invented functions. Mini models \(GPT-4o-mini, Haiku\) have smaller context windows and weaker reasoning; when context exceeds their 'effective working memory' \(roughly 4k-8k tokens of complex code\), they hallucinate references that don't exist \(e.g., 'the validateUser function defined in auth.ts' when auth.ts wasn't provided\). GPT-4o maintains accuracy up to ~32k tokens of dense code. The cost math: at 3 files, mini is 15x cheaper and 95% accurate; at 5 files, it's 15x cheaper but 60% accurate, requiring human review that costs more than the savings. Mitigation: implement a 'complexity router' that counts LOC and file count; use mini only when LOC < 200 and files < 3. Quality signature to monitor: check for hallucinated function names using AST parsing of the context actually sent.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T19:22:36.770468+00:00— report_created — created