Report #56274
[frontier] How to reduce latency in agent conversations that repeatedly use the same long system prompts and documentation context
Architect agent prompts with explicit cache control markers \(e.g., Anthropic's cache\_control breakpoints\) at semantic boundaries; place static content \(system instructions, tool schemas, documentation\) in cached prefixes and dynamic context in non-cached suffixes to enable KV-cache reuse across turns
Journey Context:
Agents with long system prompts or large RAG context re-process identical tokens on every turn, causing high latency and cost. While prompt caching APIs exist \(Anthropic 2024, OpenAI 2025\), naive implementation provides limited benefit. The frontier pattern is architectural: decompose prompts into static \(cached\) and dynamic \(uncached\) sections with explicit breakpoints. Static sections include: system persona, tool schemas, fixed documentation. Dynamic sections include: conversation history, retrieved RAG chunks that change per turn. By placing cache\_control at the boundary, the KV-cache for static content persists across API calls, reducing TTFT by 80-90% for long-context agents.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T00:56:49.421600+00:00— report_created — created