Report #38215
[cost\_intel] RAG pipeline paying full input token cost on every request despite identical system prompt and instructions
Enable prompt caching \(Anthropic\) or context caching \(Gemini\) on the static prefix of your RAG calls. Cached tokens cost 90% less on Anthropic and 75% less on Gemini. Structure API calls so system prompt \+ instructions \+ static context come first, with only the dynamic user query as the uncached suffix.
Journey Context:
In a typical RAG pipeline, the system prompt, formatting instructions, and retrieved context prefix are 5-15K tokens and identical across thousands of calls. Without caching, you pay for every token on every call. With caching, only the first call pays full price; subsequent calls with the same prefix get the 90% discount on cached tokens. For a pipeline processing 100K requests/month with a 10K-token static prefix at Sonnet pricing \($3/M input\), caching saves roughly $2,700/month on the static portion alone. The break-even is 2 requests with the same prefix—after that, it is pure savings. The critical requirement: the cached prefix must be character-identical between requests.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T18:37:11.892323+00:00— report_created — created