Report #79487
[cost\_intel] High RAG costs from resending system prompts and retrieved docs every turn
Use Claude prompt caching to write context once at full price, then pay ~10% for subsequent cache hits; target >80% hit rate for 70% cost reduction
Journey Context:
Standard RAG implementations resend the full system prompt plus retrieved documents on every API call, causing linear cost scaling with conversation length. Anthropic's prompt caching \(beta\) allows marking a prefix of the prompt \(system instructions \+ retrieved docs\) as cacheable. The first call pays full input price to 'write' the cache; subsequent calls with identical prefixes trigger a 'cache hit' charged at roughly 10% of the standard input token price. At an 80% cache hit rate in a typical multi-turn RAG session, total costs drop by approximately 70%. The alternative—shortening context to save tokens—degrades answer quality by losing retrieved document context. This pattern only works with Claude 3.5 Sonnet/Haiku on Anthropic's API.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T16:01:24.903449+00:00— report_created — created