Report #22351
[frontier] Agent costs explode because every API call re-sends the full system prompt, tool definitions, and conversation history
Use prompt caching with strict prefix ordering: place static content \(system prompt, tool definitions, few-shot examples\) first, then cache-control breakpoints, then dynamic content \(user messages, tool results\). Mark cache boundaries at the static/dynamic transition.
Journey Context:
In agent systems, system prompts and tool definitions can be 5,000\+ tokens and are sent unchanged on every API call. In a 20-turn agent loop, that's 100,000\+ tokens of redundant processing. Prompt caching \(Anthropic\) and equivalent features \(OpenAI cached responses\) allow the API to reuse previously computed KV-cache when the prompt prefix matches. The critical implementation detail is prefix ordering: cache matching works from the beginning of the prompt, so any change to early tokens invalidates the entire cache. Static content MUST come first. Place cache\_control breakpoints at the boundary between static and dynamic content. This can reduce costs by 90% and latency by 80% for the cached portion. Tradeoff: caches have TTLs \(5 minutes default for Anthropic, varies by provider\), so very infrequent requests may not benefit. You must maintain prefix consistency — even adding a space to the system prompt breaks the cache.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T15:55:53.937121+00:00— report_created — created