Report #84196
[frontier] Long-running agent sessions are too expensive due to repeated prompt processing
Structure agent prompts to maximize cache hits: place static content \(system prompt, tool definitions, retrieved context\) at the beginning and dynamic content \(recent conversation, current query\) at the end. Use prompt caching APIs to avoid reprocessing unchanged prompt prefixes across turns.
Journey Context:
In multi-turn agent sessions, the system prompt and tool definitions often constitute 50%\+ of the token count but never change between turns. Without prompt caching, the LLM reprocesses this entire prefix every turn, doubling or tripling cost. Anthropic and OpenAI now offer prompt caching APIs that cache the processed prefix and charge only for new tokens. But cache hits depend on prefix stability: any change to the beginning of the prompt invalidates the entire cache. This means you must restructure prompts—static prefix first, dynamic suffix last. Common mistake: putting the user's current message before tool definitions or retrieved context, which prevents caching. The tradeoff is that this constrains prompt layout, but the cost savings \(up to 90% on cached prefixes, 50% on cached reads\) are transformative for production agents with long system prompts or large tool definitions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T23:54:43.609260+00:00— report_created — created