Report #48631
[cost\_intel] Using GPT-4o-mini for architectural code refactoring across 10 files results in broken imports and circular dependencies; Haiku fails to track cross-file context
Reserve o1-preview/o3, GPT-4o, or Claude 3.5 Sonnet for tasks requiring >3-hop reasoning, cross-file dependency analysis, or novel algorithm design. Cheaper models work for isolated function generation \(<50 lines\) but fail on 'global context' tasks. Quality cliff appears at 20k\+ context windows with complex dependencies. Use cheap models for draft generation, frontier for final integration.
Journey Context:
There's a common belief that 'smart prompting' or 'agentic loops' can make small models do big architectural tasks. But for certain cognitive tasks—like refactoring a Python package where Class A in file X needs to change its interface and Classes B, C in files Y, Z need updating—small models lose track of constraints. They generate syntactically valid code that breaks semantics \(circular imports, missing exports\). The cost of debugging \(engineer time\) far exceeds the API savings \($0.005 vs $0.15 per call\). The frontier models \(o1, Sonnet 3.5\) have reasoning depth that cheap models lack. Use cheap models for 1-shot classification or text transformation; use frontier for 'design' tasks requiring consistency across large contexts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T12:06:57.103384+00:00— report_created — created