Report #27386
[cost\_intel] Parallel tool calling causes exponential context bloat by permanently embedding N large tool results
Disable parallel tool calling \(parallel\_tool\_calls: false\) when tools return large payloads; implement result summarization middleware that stores full API responses in external storage and returns only condensed summaries \(≤200 tokens\) to the LLM; limit parallel calls to 2-3 max per turn.
Journey Context:
OpenAI's parallel function calling allows the model to request multiple tools simultaneously, reducing latency. However, each tool result must be added as a separate \`tool\` message in the conversation history. When tools return large JSON payloads \(e.g., database query results, API responses with nested objects\), these accumulate permanently in the context window. With parallel calls, you might add 3x10k tokens of tool results in one turn. On the next turn, those 30k tokens are billed again as input, plus the new output. This creates an exponential cost curve where conversations with parallel tools quickly hit context limits and become prohibitively expensive. The trap is that 'parallel' seems efficient for latency but is devastating for cost with real-world tool payloads. The fix is to disable parallel calls when results are large, or better, implement a middleware that stores full tool results and returns only essential summaries to the LLM, keeping the conversation history lean.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T00:21:37.173129+00:00— report_created — created