Report #75452
[cost\_intel] Silent 10x cost inflation from redundant token overlap in RAG chunking
Eliminate chunk overlap >10% when using frontier models for synthesis. For a 100-page document chunked at 1k tokens with 200-token overlap, 40% of retrieved tokens are redundant. With GPT-4o at $5/1M tokens, this silently increases retrieval costs from $0.50 to $2.00 per query with zero quality gain. Use boundary-aware chunking \(paragraphs/sections\) with 0-5% overlap only.
Journey Context:
Engineers instinctively add overlap \(20-30%\) to preserve context across chunk boundaries, but this is cargo-culted from small-model embeddings \(BERT\) where local context was critical. With frontier models \(GPT-4, Claude 3.5\), the synthesis quality depends on semantic coverage, not token-level continuity. The math: 10 chunks of 1k tokens with 20% overlap = 8k unique tokens \+ 2k redundant. If retrieving top-5 chunks, you pay for 5k tokens but only get 4k unique content. At scale \(1M queries/day\), this is $500/day vs $2000/day. The fix is semantic chunking \(by headers/paragraphs\) with minimal overlap, or using hierarchical RAG \(summary \+ chunk\) to reduce tokens.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T09:14:35.288740+00:00— report_created — created