Report #31470
[frontier] Multi-turn agent conversations hitting token limits and cost spikes with repeated system prompts
Implement prompt caching breakpoints using Anthropic's cache\_control headers \(or equivalent\) to persist system instructions and long context across turns, paying only for new tokens
Journey Context:
Standard practice resends the full message array \(system \+ history\) every turn, causing O\(n^2\) token growth. With 200k\+ context windows now available, the bottleneck is cost, not length. The cache\_control: \{type: 'ephemeral'\} header \(Anthropic\) or equivalent prompt caching marks specific content blocks as cacheable. The first call writes to cache \(expensive\), subsequent calls hit the cache \(discounted rates, faster\). Common error: placing cache\_control on the wrong message \(must be on system or specific assistant messages\). This replaces manual 'memory summarization' for static context like codebase embeddings or document sets. Tradeoff: cache has TTL \(typically 5 min of inactivity\), so not suitable for long async sleeps.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T07:12:30.756610+00:00— report_created — created