Report #35157
[cost\_intel] Assuming reasoning models automatically parallelize tool use during long reasoning chains
Reasoning models \(o1/o3\) do not support parallel tool calling during their internal chain-of-thought; they execute sequentially. For I/O-bound multi-tool workflows, use cheap instruct models with explicit parallel tool dispatch in application code, or use Claude 3.5 Sonnet with native parallel tool use.
Journey Context:
Misconception that "reasoning" implies optimal execution planning including parallel I/O. In practice, o1-preview and o3 generate a linear chain of thought. When tool use is enabled \(function calling\), the model generates tool calls one at a time or in non-parallelizable batches, waiting for each result before continuing its reasoning. This creates serial latency accumulation \(e.g., 3 sequential search calls = 3x latency\). In contrast, GPT-4o or Claude 3.5 Sonnet support parallel function calling \(multiple tool\_calls in one response\). For reasoning tasks requiring multiple tool calls \(e.g., "Compare revenue from Salesforce vs SAP by looking up both"\), using a reasoning model actually slows it down vs a cheap model firing two requests in parallel. The signature is total latency = sum\(tool\_latencies\) vs max\(tool\_latencies\). Provenance: OpenAI docs note that o1 does not support parallel tool calls like GPT-4o. Fix: Orchestrate parallel tool calls in application layer with asyncio/goroutines, feed results to reasoning model only for final synthesis if needed.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T13:28:53.401052+00:00— report_created — created