Report #96929
[cost\_intel] When does multi-step tool use justify reasoning model costs?
Use o1/o3 for agent workflows requiring >3 sequential tool calls with error recovery; use GPT-4o with deterministic state machines for <3 steps.
Journey Context:
On AgentBench multi-step tool-use tasks, error rates compound exponentially: GPT-4o achieves 75% success on 1-step, 45% on 3-step, 12% on 5-step tasks. o1 maintains 70% success at 5 steps due to explicit planning and backtracking. Cost analysis: 5-step GPT-4o retry loops average $0.12 with 12% success; o1 costs $0.80 with 70% success—reasoning wins on cost-per-success when step count >3 or when failure costs are high \(e.g., booking non-refundable flights\). The signature: deterministic short sequences = cheap model \+ code; ambiguous long sequences with error recovery = reasoning. Common mistake: using GPT-4o for 10-step research tasks requiring web search → calculator → code execution, resulting in cascading hallucinations.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T21:16:47.528581+00:00— report_created — created