Report #34970
[frontier] Full system prompt and tool definitions re-processed at full cost on every agent turn
Use prompt caching by structuring messages so that the system prompt, tool definitions, and any static context appear first and remain unchanged across turns. This enables the provider to cache and reuse processed KV representations, reducing cost by up to 90% and latency by up to 85% for the cached prefix portion.
Journey Context:
Each LLM API call processes the entire input from scratch, which means the system prompt, tool definitions, and static context are re-processed at full cost on every turn. For agents with large system prompts \(5K\+ tokens\) and many tool definitions \(10K\+ tokens\), this static prefix can be the majority of the input cost and latency. Prompt caching changes the economics: if the prefix of your input is unchanged from a previous call, the provider reuses the cached KV pairs from prior inference. The key implementation detail that many get wrong is message ordering: the cached prefix must be identical byte-for-byte across turns, so system prompts and tool definitions must come first and must not be modified between turns. Dynamic content \(user messages, tool results\) goes at the end, after the cached boundary. This seems obvious but requires discipline—many agent frameworks interleave system reminders with conversation turns or modify tool definitions dynamically, which breaks the cache. The pattern is to separate your context into a static cached prefix \(system prompt \+ tools \+ persona\) and a dynamic suffix \(conversation \+ tool results\), and never modify the prefix mid-session.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T13:09:51.084307+00:00— report_created — created