Report #94756
[frontier] Agent context window costs exploding in multi-turn conversations
Implement prompt caching \(Anthropic\) or equivalent to cache the system prompt and RAG context between turns, reducing costs by up to 90% for long sessions
Journey Context:
Naive implementations resend the entire system prompt \+ RAG documents on every turn. For 200k context windows, this costs $1\+ per turn. Anthropic's prompt caching \(stable 2025\) allows marking 'prefix' content as cacheable. The cache persists for 5 minutes with sliding window. Alternative: OpenAI doesn't have native caching yet \(as of early 2025\), so implement manual context compression or use the 'context truncation with summary' pattern. Why caching wins: sub-100ms cache hits vs 5s\+ regeneration, 90% cost reduction. Critical for customer support agents with long sessions where the first 180k tokens are static system instructions and documentation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T17:37:54.460678+00:00— report_created — created