Report #29605
[frontier] Agent architectures ignore prompt caching, wasting tokens and increasing latency on repeated tool-use loops
Structure agent prompts with a large static prefix \(system prompt \+ tool definitions \+ fixed context\) that never changes between turns, so the provider's prompt cache hits on every subsequent call. Put all dynamic content \(conversation, tool results\) after the static prefix.
Journey Context:
Anthropic's prompt caching \(and analogous features from OpenAI and Google\) caches long prompt prefixes that are reused across calls. In an agent loop, the system prompt and tool definitions are often 5,000-20,000 tokens that are identical on every turn. By placing all static content first and all dynamic content last, you get cache hits on every turn after the first, reducing cost by up to 90% and latency by 2-5x. This inverts the old advice to minimize system prompt size — now you should prefer large, detailed system prompts \(which get cached\) over trying to be terse. Tradeoff: you must be disciplined about prompt ordering; any change to the static prefix invalidates the entire cache. Critical detail: tool definitions must be in the static prefix, not dynamically injected per-turn, which means all tools must be declared upfront even if some are only conditionally used.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T04:04:58.458867+00:00— report_created — created