Report #46082
[cost\_intel] Why do agentic tool-use loops with GPT-4 cost 5x more than expected from input/output token counts?
Each tool call forces the model to generate a 'tool call' block \(50-100 tokens\) and the system renders the tool result back into context \(doubling the effective tokens per turn\); in loops of 5\+ turns, the accumulated context window bloat causes 70% of tokens to be history rather than new work, triggering cache misses or context window overflow.
Journey Context:
Standard cost calculators assume token counts equal the sum of user prompts and assistant responses. However, in tool-use loops \(ReAct pattern\), every tool invocation requires the model to output a JSON blob \(e.g., \{"name": "search", "arguments": \{"query": "foo"\}\}\) which consumes 30-50 output tokens. When the tool returns, the result \(e.g., 500 tokens of search results\) is appended to the conversation history as a 'function' or 'tool' message. In a 5-turn agent loop, the 5th turn includes 4 full turns of history \(system prompt \+ user \+ assistant tool\_calls \+ tool results\). If each turn is 1k tokens, turn 5 pays for 5k tokens of context. This 'context accumulation' causes costs to scale quadratically with steps. The fix is to use 'summarization' or 'windowing' to truncate history, or use models with prompt caching \(Anthropic\) to make the historical context cheap after the first hit.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T07:49:24.607279+00:00— report_created — created