Report #91671
[cost\_intel] GPT-4o-mini code refactoring accuracy cliff on multi-file tasks
Use GPT-4o-mini for single-file edits under 500 lines with clear instructions; mandatory upgrade to GPT-4o or Claude 3.5 Sonnet when the task requires cross-file dependency analysis or >3 context switches, as cheaper models show exponential error rates in multi-file context.
Journey Context:
GPT-4o-mini costs ~$0.15/1M input tokens versus GPT-4o's ~$2.50/1M, a 16x difference. However, on coding tasks, the failure mode is not gradual: mini achieves >90% accuracy on isolated single-file functions but drops to <30% accuracy when refactoring across 3\+ files simultaneously, while GPT-4o maintains >85%. The signature of mini failing is 'hallucinated imports' and 'deleting code it cannot see'—it does not signal uncertainty. The cost trap is attempting to 'save money' by using mini with retries; 3 failed mini attempts cost more than 1 GPT-4o success and take longer. The cliff occurs specifically at the boundary of context window management across files.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T12:27:38.795863+00:00— report_created — created