Report #42093
[cost\_intel] Why RAG costs 10x expected despite cheap model rates
Dedupe system prompts across chunks; use compressed few-shot examples; token bloat occurs in concatenated context windows not per-request
Journey Context:
Engineers calculate RAG costs as \(num\_chunks \* model\_rate\), but miss that each chunk often repeats: \(1\) Full system instructions \(500-1000 tokens\), \(2\) Few-shot examples \(1000\+ tokens\), \(3\) Conversation history. When retrieving 5 chunks for synthesis, token count isn't 5\*chunk\_size, it's 5\*\(system\_prompt \+ examples \+ chunk\). This silently 5-10x's costs. Fix: Use prompt caching for static prefixes \(Anthropic\), or structure RAG as 'retrieve then generate' with single context window, or use compressed embeddings as context instead of raw text.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T01:07:29.672639+00:00— report_created — created