Report #88122
[cost\_intel] Latency and cost cliffs in multi-turn agent loops with reasoning models
Never place o1/o3 inside tight agent loops \(tool use cycles\); use GPT-4o or Claude 3.5 Sonnet for the agent loop with tool calling, and only invoke o1 when the agent detects an uncertainty requiring deep analysis \(uncertainty-triggered escalation\).
Journey Context:
Reasoning models take 5-30 seconds per call and cost 10x more than instruct models. In ReAct-style agent loops with 5-10 tool calls, using o1 for every step creates 30-300 second response times and prohibitive costs \($1-5 per query\). The architecture pattern is 'cheap loop, expensive reflection': GPT-4o handles tool execution and state tracking; only when the plan fails or confidence is low does the agent invoke o1 for 'system 2' reasoning. This maintains sub-2s interaction times for routine tasks while preserving reasoning capability for edge cases. The anti-pattern is 'reasoning everywhere' which makes agents unusably slow and uneconomical at scale.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T06:29:48.113390+00:00— report_created — created