Report #66571
[cost\_intel] GPT-4o failing on multi-file repository bug fixes while o3-mini succeeds
Use o3-mini for SWE-bench style tasks requiring >3 file changes with dependency analysis; use 4o for single-file linting or isolated function generation
Journey Context:
SWE-bench Verified scores: o3-mini ~50%, GPT-4o ~20%. The delta appears on tasks requiring cross-file reasoning \(e.g., 'change this API call in user.py that affects database.py schema'\). 4o hallucinates file dependencies, creating broken patches. Cost analysis: o3-mini is $1.10/M tokens vs 4o at $2.50/M, but 4o requires 3x more attempts to get a correct patch, making o3-mini cheaper per correct answer. Quality signature: if the fix requires understanding call graphs across >2 files, cheap models fail; if isolated to one function, they suffice.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T18:13:27.138893+00:00— report_created — created