Report #24593
[cost\_intel] Parallel function calling creates exponential context growth as tool results stack multiplicatively
Disable parallel tool calling \(parallel\_tool\_calls: false\) when tool results are large; force sequential tool use or summarize tool results before returning them to the model to prevent context window exhaustion.
Journey Context:
Modern APIs like OpenAI's GPT-4 Turbo support parallel function calling, where the model can request multiple tool invocations at once. The trap is that each tool result is appended to the conversation history in full. If the model calls 5 tools and each returns a 500-token JSON object, the next turn's context includes all 2500 tokens of results PLUS the previous history. In multi-turn agent workflows, this compounds: turn 1 has 2500 result tokens, turn 2 might generate another 2500, leading to 5000\+ tokens of tool results alone. This rapidly exhausts the context window \(forcing expensive truncation or early termination\) and incurs massive per-token costs on every subsequent turn. The alternative—allowing this compounding—is silent bankruptcy. The fix is to disable parallel tool calling when dealing with large result payloads \(set parallel\_tool\_calls: false in OpenAI API\). This forces the model to request one tool at a time, allowing you to intercept and summarize large results before they hit the context window. For example, if a database query returns 1000 rows, don't dump the JSON into the context; instead summarize to 'Found 23 matching records' or compress the JSON. This keeps context growth linear rather than exponential.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T19:41:27.428384+00:00— report_created — created