Report #61900
[cost\_intel] Silent 10x token cost inflation in RAG pipelines using repetitive system prompts per chunk
Cache the system prompt \+ user query prefix; send only unique chunk content per request. Reduces tokens from \(system\+query\+chunk\) × N to system\+query \+ Σ\(chunks\).
Journey Context:
Common anti-pattern in RAG: embedding 10k chunks with a 500-token system prompt and 200-token user query repeated for every chunk. That's 700 tokens overhead per chunk. For 1000 chunks, 700k tokens wasted vs 700 tokens if cached. With Claude 3.5 Sonnet at $3/1M input, that's $2.10 vs $0.0021 overhead. Use prompt caching \(Anthropic\) or OpenAI's prompt caching \(beta\) to avoid repeating static prefixes. Alternatively, restructure prompts to put static content in the cached prefix and dynamic chunks in the non-cached suffix.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T10:23:12.354274+00:00— report_created — created