Report #80440
[cost\_intel] Why do reasoning models underperform instruct models on function calling?
Avoid o1 for multi-step tool use loops requiring strict JSON schema adherence; use GPT-4o for tool execution and reserve o1 for planning the tool sequence. o1 has higher rates of 'thinking' tokens leaking into JSON or ignoring required fields, with reliability dropping 10-15% on multi-turn tool calls compared to GPT-4o.
Journey Context:
Reasoning models prioritize deliberative alignment over tool schema adherence. Early o1-preview lacked tool support entirely; o1-mini and later versions added it but with higher latency and 'creative' JSON formatting \(e.g., adding comments in JSON\). The cost is 5-10x higher for worse reliability on tool calls. The signature is 'Invalid JSON' errors or missing required keys in the function arguments. The alternative architecture is 'Planner-Executor': GPT-4o executes tools based on a plan generated by o1 in a single shot, avoiding multi-turn reasoning latency.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T17:37:45.697938+00:00— report_created — created