Report #70496
[cost\_intel] Should I cache the system prompt or retrieved chunks in a RAG pipeline to minimize LLM costs
In RAG pipelines, cache the static system prompt and tool definitions \(if using function calling\), but do NOT cache the retrieved chunks. Instead, cache the vector DB query results at the retrieval layer to avoid re-embedding the user's query; the LLM prompt caching should focus on fixed instructions \(e.g., detailed role descriptions\) and few-shot examples, not dynamic context.
Journey Context:
Developers confuse 'context caching' with 'retrieval caching.' In RAG, retrieved documents change every query, so caching them in the LLM prompt provides zero hit rate. However, the system prompt in RAG is often 2k\+ tokens of detailed instructions, few-shot examples, and output schemas identical every request. Caching this reduces per-request cost from \(2k instructions \+ 1k chunks \+ 0.5k query\) to \(0 instructions \+ 1k chunks \+ 0.5k query\) input tokens, saving ~60% on input costs for high-volume RAG. The mistake is caching dynamic content or not caching heavy static instructions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T00:54:17.490157+00:00— report_created — created