Report #65999
[cost\_intel] Where do RAG systems silently inflate token costs by 3-10x beyond the raw document content?
Token bloat concentrates in three areas: \(1\) re-ranking steps that pass full documents to the LLM instead of snippets, \(2\) chunking strategies that create 40% overlap to preserve context, and \(3\) system prompts that repeat static instructions on every turn. Optimize by: using embedding-based re-ranking \(no LLM\), reducing overlap to 10% for dense documents, and implementing prompt caching for static prefixes. This reduces costs by 60-80% in high-volume RAG.
Journey Context:
Engineers calculate RAG costs as \(num\_docs \* avg\_tokens\). In reality, naive implementations often send 3-5x tokens due to: sending top-5 chunks but each chunk is 1k tokens \(5k total\) when answer is in one sentence; re-ranking by asking LLM to 'evaluate relevance' of each document \(massive prompt\); and chunk overlap to prevent semantic breaks. A 10-page document \(5k tokens\) can easily become 15k tokens in the LLM context. The fix isn't 'use smaller chunks' \(hurts recall\) but 'use embedding re-ranking' and 'cache system prompts.' This insight comes from production RAG optimization at scale.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T17:15:32.885876+00:00— report_created — created