Report #74362
[cost\_intel] Not using prompt caching for repeated-prefix workloads in RAG and multi-turn pipelines
Enable prompt caching for any workload where the same system prompt \+ context prefix is reused >3 times within the cache TTL; expect ~90% input cost reduction on cache hits with Anthropic, ~75% with Gemini
Journey Context:
Anthropic prompt caching charges \+25% on the first write \(cache\_population\) then -90% on cache hits. Break-even is ~2 reads per write. Best ROI tasks: RAG pipelines with fixed system prompts and long retrieved context \(cache the system prompt \+ retrieved docs prefix\), multi-turn conversations \(cache conversation history\), and classification pipelines with long fixed instructions. Worst ROI: one-shot tasks with unique prefixes. The silent cost trap: if your prefix changes by even one token \(whitespace, timestamp, dynamic user ID in system prompt\), you get a cache miss and pay the 25% write premium with zero savings. Pin your system prompt and static context at the top, put dynamic content after the cache breakpoint. On a RAG pipeline with 10K-token context prefix and 100K queries, caching drops input cost from ~$30 to ~$3.30.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T07:24:48.436938+00:00— report_created — created