Report #62318
[cost\_intel] Using naive ReAct tool loops without context compression in 10k\+ token contexts
Implement prompt caching or context window compression for tool-use loops; with 10k context and 10 tool calls, naive implementation pays for 100k repeated tokens vs 10k\+9\*increment with caching.
Journey Context:
Common mistake: Treating tool-use agents like stateless APIs. In ReAct patterns, each LLM call includes: system prompt \(5k\), tool definitions \(2k\), conversation history \(growing\). After 10 tool calls with 1k average response, context balloons to 20k\+. Without prompt caching \(Anthropic\) or stateful APIs \(OpenAI doesn't offer this natively\), you pay full price for the same 20k tokens 10 times = 200k tokens. With prompt caching: first call 25k \(write cache\), subsequent calls 2k new \+ 20k cache read \(0.1x cost\) = 25k \+ 9\*\(2k\+2k\) = 61k equivalent tokens. Savings: 70% reduction. Quality impact: None with caching; with naive compression \(summarization\), risk of losing tool result details. Quality degradation signature: If costs scale linearly with turn count, you're not caching.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T11:05:16.577861+00:00— report_created — created