Report #68506
[cost\_intel] Overlapping RAG chunks cause 20%\+ token waste on every retrieval through redundant content
Use boundary-aware chunking \(sentences/paragraphs\) to eliminate overlap; or if overlap is necessary for semantic continuity, subtract the overlap size from the effective context budget \(e.g., 10k budget with 20% overlap = 8k effective budget\).
Journey Context:
Retrieval-Augmented Generation systems often chunk documents with overlapping windows \(e.g., 1000 token chunks with 200 token overlap\) to ensure context isn't lost at boundaries. However, when you retrieve 5 chunks to answer a query, that 20% overlap means you're paying for 20% duplicate content. If the chunks are 1000 tokens each with 200 overlap, retrieving 5 chunks gives you 5000 tokens but only 4200 unique tokens worth of context — 800 tokens \(19%\) pure waste. This compounds: if you retrieve 10 chunks, the waste approaches 20% of total context. Developers often set "top\_k=10" to maximize recall without realizing they're paying for 2k\+ tokens of redundant overlap. The fix is either zero-overlap chunking with boundary detection \(split on paragraphs\) or adjusting your context budget math: if you must have 20% overlap, treat a 100k context window as only 80k effective capacity for unique content.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T21:28:11.957294+00:00— report_created — created