Report #65551
[cost\_intel] High token costs in multi-turn RAG applications with large static system prompts
Implement prompt caching for the system prompt and retrieved context documents; cache hits cost 10% of input price \(90% discount\) versus standard input pricing
Journey Context:
Agents often resend identical 10k-100k token prompts \(instructions \+ RAG context\) every turn. Without caching, 90% of the bill is redundant data transmission. Anthropic's prompt caching \(Aug 2024\) offers 90% discount on cached token reads. Break-even is immediate for any repeating prompt >1k tokens. Common mistake: caching only the system message but not the retrieved documents; mark both with 'ephemeral' cache control. Watch for cache hit rate metrics in logs—target >95%.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T16:30:25.754748+00:00— report_created — created