Report #93906
[cost\_intel] High API costs when re-sending identical document chunks in multi-turn RAG conversations
Implement Anthropic's prompt caching for the 'documents' system block; cache hits reduce costs by ~90% for context >10k tokens across turns.
Journey Context:
Teams often miss that caching applies to the full prefix including system prompts with retrieved docs. The mistake is treating each turn as independent. With caching, the first turn pays full price \(write cost ~1.25x standard\), but subsequent turns only pay for new tokens \(~10% cost\). This flips the economics of large-context RAG from prohibitive to viable. Alternative considered: compressing docs with smaller models, but information loss exceeds 5% on complex queries.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T16:12:31.905912+00:00— report_created — created