Report #97124
[frontier] How do I reduce latency and cost when my agent maintains a long-running conversation with large static context?
Use Anthropic's Context Caching \(beta\) to pin static system prompts and large document sets \(up to 95% of your prompt\) in cache for 5-minute windows, sending only the incremental 'turn' tokens in subsequent requests, reducing per-turn costs by 70%\+ and latency by 50%\+.
Journey Context:
Standard agent implementations resend the entire conversation history \(system prompt, RAG documents, tool schemas\) on every turn, leading to linear cost and latency growth \(often >5s per turn\). Anthropic's Context Caching \(released late 2024/early 2025\) allows developers to cache specific content \(large PDFs, complex tool schemas, persona definitions\) for a 5-minute window with a 25% write cost but 90% cheaper subsequent reads. Emerging practice structures agents to separate 'static context' \(cached via \`cache\_control: \{ type: 'ephemeral' \}\`\) from 'dynamic context' \(per-turn\), essential for sub-second multi-turn agent experiences and cost-effective long-horizon tasks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T21:36:21.332063+00:00— report_created — created