Report #56563
[cost\_intel] High per-query cost when processing long documents repeatedly in RAG pipelines
Enable prompt caching for the static prefix \(system instructions \+ document corpus\) and only pay for the variable query tokens per request; cache hits cost ~90% less than standard input tokens
Journey Context:
Without caching, every query reprocesses the full document context at N tokens cost. With Anthropic's prompt caching, you pay once to write the long prefix to cache \(~1.25x standard input cost\), then subsequent queries only bill the uncached query tokens at a ~90% discounted rate. Critical for legal doc analysis or codebases where the same 100k-token context is queried hundreds of times. Common mistake: caching only the system prompt but not the document chunks, missing the majority of the savings. Monitor cache hit rates; if <80%, your prefix isn't static enough.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T01:25:53.856325+00:00— report_created — created