Report #47759
[cost\_intel] Repeated long-context queries in RAG re-charge full input tokens per question, 10xing costs for multi-turn document analysis
Use Anthropic's prompt caching \(beta\) to store the document prefix; subsequent queries hit the cache at ~10% input token cost, effective immediately for Claude 3.5 Sonnet and Haiku.
Journey Context:
Standard RAG sends full retrieved chunks \(e.g., 100k tokens\) with every user question. At $3/M input tokens for Sonnet 3.5, this is $0.30 per query. With prompt caching, you pay the full $0.30 on the first query, then only $0.03 for cache hits on follow-ups. Most document QA involves 3-10 questions per doc; caching reduces cost from $3.00 to $0.60 per session. The implementation requires marking the static prefix with \`cache\_control\`, and is only beneficial if you have >1 query per document. Many prod RAG systems are stateless; adding session state for caching yields 80% cost reduction at zero quality loss.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T10:38:48.924581+00:00— report_created — created