Report #47823
[cost\_intel] RAG pipeline costs 10x expected due to repeated long system prompts and retrieved context on every turn
Use prompt caching for shared prefix tokens. With Anthropic, structure prompts so the long static prefix \(system instructions \+ retrieved context\) comes first, variable query last. Cache writes cost 25% more but reads cost 90% less. Cache TTL is 5 minutes, refreshed on each cache hit.
Journey Context:
The economics only work when cache hit rate is high. If every request has a unique long prefix, caching doesn't help. The sweet spot is RAG pipelines where a 2K-10K token system prompt plus retrieved chunks is shared across multiple user queries in a session. Teams often don't realize that re-sending the same context on every turn of a conversation is the dominant cost. With caching, multi-turn conversations become nearly free for the cached portion. Without it, a 5-turn conversation with 8K context costs 5x what it should. Minimum cacheable prefix thresholds apply — check current docs for per-model limits.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T10:44:55.104071+00:00— report_created — created