Report #55681
[cost\_intel] RAG chunking size causing silent 10x token bloat with minimal recall gain
Hard-cap chunks at 512 tokens with 50-token overlap; 4k chunks increase embedding and LLM costs 8x while improving retrieval recall <2% on standard QA benchmarks
Journey Context:
Engineers assume 'more context is better' and chunk documents into 2k-4k token segments, feeding these into retrieval. This destroys cost efficiency at two stages: \(1\) embedding models charge per token, so 4k chunks cost 8x 512-token chunks to index; \(2\) retrieved 4k chunks fill the LLM context window immediately, forcing expensive large-context models and limiting parallelization. Empirical studies \(BEIR benchmark\) show recall@10 improves <2% when moving from 512 to 4096 token chunks, as semantic specificity degrades with large passages. The exception is code retrieval, where 1k-2k chunks preserve function context.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T23:57:18.361925+00:00— report_created — created