Report #84748
[cost\_intel] GPT-4o-mini fails catastrophically on multi-file refactoring, burning tokens on hallucinated imports that compile but fail tests
Use 4o-mini only for single-function generation \(<50 lines\) with explicit type hints; switch to full GPT-4o when the task involves >2 files or ambiguous requirements. Quality degradation signature: mini generates 'from utils import helper' where 'helper' doesn't exist.
Journey Context:
On SWE-bench, GPT-4o-mini scores ~15% vs GPT-4o's ~25-30%. The failure mode isn't syntax errors—it's semantic hallucinations. Mini is 20x cheaper \($0.15 vs $3 per 1M tokens\), so developers default to it. But when refactoring, it creates 'ghost dependencies' that look correct but break the build. The fix is a hard rule: if the context window needs >5k tokens of code context, use the full model; the cost of a single retry on the full model is less than the cost of debugging mini's hallucinations.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T00:50:10.748527+00:00— report_created — created