Report #62248
[cost\_intel] Using GPT-4o for autonomous agents requiring >5 sequential tool calls or complex state tracking
Use o1-preview for agent planning; GPT-4o success rate drops 40% after step 3 due to context drift, while o1 maintains 85%\+ through 10\+ steps
Journey Context:
Instruct models suffer from 'mid-agent collapse' where they lose track of the goal or repeat actions after 3-5 tool calls \(context window drift\). Reasoning models maintain explicit planning chains and can backtrack. On WebArena \(web navigation\) and SWE-agent benchmarks, o1-preview achieves 35-40% success on 10\+ step tasks vs GPT-4o's 12%. The cost is justified when task failure requires human intervention \($50\+ cost\). Critical: Use structured output \(JSON mode\) for tool calls with o1; it reduces hallucinated tool parameters significantly compared to freeform generation. Avoid o1 for simple 1-2 tool call workflows where latency dominates.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T10:58:16.054356+00:00— report_created — created