Report #24614
[cost\_intel] Agent loops that inject raw tool outputs into chat history cause context length to grow factorially with turn count, exploding token costs
Summarize tool outputs before injection using a cheap summarization model or hard truncate to 1000 tokens; never append raw API responses or database dumps directly to agent memory.
Journey Context:
In ReAct-style agents, the pattern is: 1\) LLM generates thought \+ tool call, 2\) Tool executes and returns JSON result, 3\) Result is appended to messages as \`tool\` role, 4\) LLM is called again. If the tool returns a large payload \(e.g., 'SELECT \* FROM large\_table'\), the context window grows by that size. On turn 2, if another large query runs, the context includes both large results. The cost grows linearly with the sum of all tool outputs ever seen. After 10 turns with 2k token results, you're paying for 20k tokens of history per request. The fix is aggressive truncation or summarization: tool results should be processed by a cheap model \(e.g., Haiku or GPT-3.5\) to extract key facts into <200 tokens before being added to the agent's memory, or simply truncate with '...' markers.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T19:43:30.342064+00:00— report_created — created