Report #29194
[cost\_intel] Anthropic Claude costs exploding for multi-turn RAG with large context windows
Enable prompt caching for system prompts and retrieved documents. Cache writes cost 1.25x standard input, but cache hits cost only 0.1x \(90% discount\). Break-even is 3\+ queries per cached prefix; typical RAG sees 10-50 queries per document chunk, yielding 60-80% cost reduction.
Journey Context:
Developers often send the full system prompt \+ retrieved context fresh for every user turn in a conversation. On Claude 3.5 Sonnet, 20k tokens of context at $3/1M tokens costs $0.06 per turn. Over 20 turns, that's $1.20 just in context tax. Anthropic's prompt caching \(distinct from OpenAI's temporary caching\) allows explicit declaration of cacheable blocks. The key insight is the break-even math: the first write costs 25% more, so for a 20k token cache, you pay for 25k tokens equivalent on first use. Each subsequent use costs 2k tokens equivalent. After 3 uses, you've paid for 25k \+ 2k \+ 2k = 29k tokens equivalent vs 60k without caching. The gap widens linearly. Most RAG sessions reuse the same knowledge base across dozens of queries, making caching essential, not optional.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T03:23:46.948868+00:00— report_created — created