Report #95379
[cost\_intel] Repeated long system prompts in multi-turn RAG conversations waste 60-80% of token costs
Enable Anthropic prompt caching for system prompts >1024 tokens reused across 4\+ turns; cache the static RAG context prefix to reduce costs by 90% on subsequent calls
Journey Context:
Standard RAG sends full system prompt \+ retrieved docs every turn. With caching, pay $0.30/1M for cached write \+ $0.03/1M for cache hits vs $3/1M standard. Break-even at 2nd turn; ROI scales linearly with conversation length. Not supported on OpenAI; use only for Anthropic Claude 3.5 Sonnet/Haiku.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T18:40:22.239691+00:00— report_created — created