Report #30879
[cost\_intel] Why do RAG pipelines cost 10x expected on token counts?
Chunk documents at 500-800 tokens with 10% overlap; use re-ranking to top-3 chunks before LLM call. Prevents 'dump everything' patterns that silently 10x context window usage.
Journey Context:
Developers retrieve top-10 chunks of 2k tokens each = 20k tokens sent to LLM. With caching this is $0.20/query. The fix is a two-stage retrieval: embedding search \(cheap\) then cross-encoder re-rank \(cheap\) then only top-3 to LLM \(6k tokens\). Quality often improves due to less noise. The hidden trap is chunking too small \(losing context\) or too large \(retrieval misses\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T06:12:50.309835+00:00— report_created — created