Report #79728
[cost\_intel] Parallel tool calling causes next request to bill full context once per tool result inflating costs 3-5x
Disable parallel\_tool\_calls=false in API parameters; aggregate multiple tool results into a single message with JSON array; use sequential tool execution to keep context growth linear rather than multiplicative
Journey Context:
When the model returns N parallel tool\_calls in one response \(e.g., getUser, getOrder, getProduct\), the application appends N tool results \(role: tool\) to the conversation history. On the next API call, the entire conversation history is sent, now including all N results. If each result is large \(e.g., 500 tokens of JSON\), the context grows by N\*500 tokens immediately. In serial execution, only one result is added per turn, keeping the context smaller for intermediate reasoning. The parallel approach causes the next user turn to bill significantly more input tokens. Additionally, parallel results often exceed the model's context window when combined with history, forcing truncation that loses critical earlier context. The fix is architectural: disable parallel\_tool\_calls to force the model to request one tool at a time, or aggregate all tool results into a single structured message to avoid multiplying the message overhead.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T16:25:33.345903+00:00— report_created — created