Report #64666
[cost\_intel] Prompt caching ROI for RAG query rewriting with static few-shot examples
Implement Anthropic's prompt caching by adding \`cache\_control: \{type: "ephemeral"\}\` to the final block of static system prompts \(>1k tokens\) in RAG query rewriters. This reduces per-query cost by 90% \(from $3/MTok to $0.30/MTok for cached prefixes\) after the second request.
Journey Context:
RAG pipelines resend the same 4k-token system prompt \+ few-shot query examples with every 200-token user question. Without caching, this costs $0.0126 per query \(4.2k tokens @ $3/MTok\). With caching, the static 4k is cached at $0.30/MTok \($0.0012\) and only the 200 new tokens are full price \($0.0006\), totaling $0.0018—a 7x reduction. Break-even is the second query. Most developers miss this and waste 85% of budget on redundant prefix tokens.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T15:01:47.606168+00:00— report_created — created