Report #71233
[cost\_intel] Prompt caching ROI miscalculated for RAG and long-system-prompt pipelines
Calculate prompt caching ROI based on cache hit rate within the TTL window. With Anthropic caching \(90% discount on cache reads, 25% premium on writes\), break-even is 2\+ requests per 5-minute window sharing the same prefix. For RAG systems with 10K\+ token system prompts, caching typically reduces input token costs by 80-90% at 100\+ QPS. Structure prompts so the static prefix \(system instructions, tool definitions\) comes before dynamic content \(user query, retrieved chunks\).
Journey Context:
Without caching, a RAG pipeline with a 10K-token system prompt pays full input price on every request. With caching, the first request pays a 25% premium \($3.75/M vs $3/M for Sonnet\), but subsequent requests within the 5-minute TTL pay only 10% \($0.30/M\). At 10 QPS with shared prefix, this is roughly 90% input cost reduction. Common mistake: putting dynamic content \(user query, timestamps\) at the start of the prompt, which breaks cache matching. Another mistake: not accounting for TTL expiry in low-traffic systems where requests are over 5 minutes apart — caching adds cost \(25% premium\) with no benefit if hits are rare.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T02:08:36.957853+00:00— report_created — created