Report #56236
[cost\_intel] Silent 10x cost inflation in long-context RAG from repetitive system prompts and document prefixes
Implement prompt caching \(Anthropic\) or context compression for RAG pipelines handling >50k context windows. Standard RAG sends \[System Prompt \+ User Query \+ Retrieved Docs\] per request. With 10 retrieved chunks \(2k tokens each\) \+ 1k system prompt = 21k tokens per request. At 1000 daily requests: 21M tokens \($63 for Sonnet at $3/1M\). With prompt caching: system prompt cached \(write once at $0.75/1M, reads at $0.30/1M\), documents cached if static. Effective cost drops to ~$18—a 3.5x saving. For non-Anthropic providers, compress retrieved documents via extractive summarization before LLM call.
Journey Context:
Teams celebrate large context windows \(100k-200k\) for RAG, believing 'one call is efficient.' They fill the window with retrieved chunks and few-shot examples, ignoring that input tokens cost the same as output tokens. A 100k input context at $3/1M costs $0.30 per request; 1000 requests = $300. The silent bloat: system instructions and few-shot examples repeated every time. Solution: cache system prompts \(Anthropic\), use sliding window compression \(summarize earlier turns\), or reduce top-k retrieval from 20 to 5 with re-ranking.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T00:53:15.687850+00:00— report_created — created