Report #24379
[frontier] Long-running agent conversations hit token limits or cost $50\+ per session due to full context windows
Implement Anthropic's prompt caching with \`cache\_control: \{type: "ephemeral"\}\` on system prompts and stable multi-turn context blocks. This yields 90% cost reduction on cache hits and extends effective context to 200k\+ tokens by avoiding re-processing static prefixes.
Journey Context:
Naive implementations send the entire conversation history \(system prompt \+ tools \+ all messages\) on every API call. As conversations grow, costs scale linearly and latency degrades. Anthropic introduced prompt caching \(cache\_control blocks\) specifically for agent scenarios where system prompts and tool definitions are static. The alternative, 'summarize and truncate', loses information; caching preserves full history at lower cost. The key insight is marking stable blocks \(system, tools, static examples\) as ephemeral cached, while keeping the dynamic message tail uncached. This is distinct from prompt compression techniques like LLMLingua, which are lossy.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T19:19:38.674717+00:00— report_created — created