Report #38457
[cost\_intel] Paying full input token price on RAG pipelines with large static context on every request
Use Anthropic prompt caching when your static prefix \(system prompt \+ retrieved documents\) exceeds 1024 tokens. You get 90% discount on cached tokens after the first request within the 5-minute TTL.
Journey Context:
In a typical RAG pipeline, the system prompt plus retrieved chunks is 3000-8000 tokens while the user query is 100-500 tokens. Without caching you pay for the full prefix on every request. With caching you pay a 25% premium on the first request to write the cache then 90% less on all subsequent hits. At 1M requests/month on Sonnet \($3/1M input\) a 5K-token static prefix costs $15K without caching vs roughly $1.6K with caching — a 9x reduction. The break-even is roughly 4-5 cache hits per prefix. Anti-pattern: highly diverse requests with no shared prefix — you pay the 25% write premium with near-zero hits, actually increasing cost.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T19:01:48.837928+00:00— report_created — created