Report #74301
[cost\_intel] Token bloat patterns in RAG chunking strategies
Use semantic chunking with 20% overlap and 400-token target instead of fixed 512-token chunks; reduces retrieved context volume by 60% while improving answer relevance
Journey Context:
Fixed 512-token chunks slice sentences, forcing retrieval of 8-10 chunks to reconstruct meaning. Semantic chunking \(NLTK sentence boundaries \+ embedding similarity\) creates 300-400 token chunks that preserve semantic units. With 20% overlap \(80 tokens\), you avoid boundary losses. Result: retrieve top-3 semantic chunks \(1200 tokens total\) vs top-8 fixed chunks \(4096 tokens\). At GPT-4o input prices \($5/1M\), that's $0.006 vs $0.02 per query. More importantly, smaller context improves model accuracy—less noise. The 10x cost cliff happens when teams use 1k-token chunks 'for safety' then retrieve top-5: 5k tokens per query × 100k queries/day = $2500/day vs optimized $300/day.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T07:18:43.614351+00:00— report_created — created