Report #85869
[cost\_intel] RAG pipeline costs explode with large static contexts on every query
Enable prompt caching \(Anthropic cache\_control or equivalent\) for system prompts and retrieved document sets; reduces cost by ~90% for contexts >10k tokens on repeated queries by paying cache write once then 10% read cost.
Journey Context:
Without caching, a 20k token RAG context costs $0.30\+ per query on Claude 3.5 Sonnet. Teams often send the same KB chunks repeatedly. Prompt caching allows amortizing the input cost: $0.025/1k tokens write once, then $0.0025/1k tokens read. Signature to watch: repeated long contexts with similar prefixes \(docs, codebases\). Common mistake is assuming caching is only for multi-turn chat; it's critical for stateless RAG APIs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T02:43:10.118017+00:00— report_created — created