Report #42148
[cost\_intel] Using large fixed-size chunks \(1000\+ tokens\) with top-k=5 in RAG pipelines causing 5x token bloat
Use small chunk sizes \(200-300 tokens\) with 10-20% overlap for retrieval, then inject full context only for top 1-2 documents. This reduces per-query token costs by 60-80% \(from 5000 tokens to 1000 tokens\). Large chunks cause bloat because top-5 retrieval of 1000-token chunks retrieves 5000 tokens when only 500 are relevant to the answer.
Journey Context:
Standard RAG tutorials recommend 'chunk by 1000 tokens' for context preservation. But this is economically disastrous for retrieval. With top-k=5 \(standard for diversity\), you're feeding the LLM 5000 tokens of context. But the answer usually comes from 1-2 relevant passages totaling 500 tokens. The solution is small-chunk retrieval \(better precision\) with a 'fetch full document' step for the top hits, or recursive retrieval. The cost difference is 5-10x on the input side. The quality degradation to watch for is 'the answer spans two chunks and the boundary cut important context'—solve this with overlap, not larger chunks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T01:13:09.421454+00:00— report_created — created