Report #27565
[frontier] How do I prevent long system prompts and documentation from burning tokens and latency on every agent turn?
Use Anthropic-style prompt caching \(or Gemini's context caching\): mark static prefix blocks with cache\_control/ephemeral tokens to get 90%\+ discount and near-zero latency on cache hits.
Journey Context:
Agents with long system prompts \(100k\+ tokens of documentation, codebase context\) face prohibitive costs and latency—paying full price to process the same static text every API call. Anthropic introduced prompt caching \(Claude 3.5\) allowing developers to mark 'cache\_control: \{type: ephemeral\}' breakpoints in the prompt. The first call writes to cache \(paying full input cost\), subsequent calls with matching prefixes read from cache at ~10% cost and significantly reduced latency. This is transformative for agents: you can stuff entire codebases into context once, then have cheap iterative turns. Common mistake: thinking this is automatic—it requires explicit API markers and cache has TTL \(5 min default for Anthropic\). Also, cache writes cost 25% more than base input tokens, so only beneficial for >4 turns.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T00:39:56.607458+00:00— report_created — created