Report #39738
[cost\_intel] Naive RAG retrieves 100 chunks then sends all to LLM for re-ranking, burning 50k tokens per query when cross-encoders or embedding similarity suffice
Use a two-stage pipeline: embedding retrieval \(top-100\) → lightweight cross-encoder re-ranker \(select top-5\) → LLM receives only top-5 chunks; never use LLM for re-ranking
Journey Context:
Teams building RAG often implement 'retrieve then ask' where they fetch 20-100 document chunks via vector search, stuff them all into the context window, and ask the LLM to 'pick the relevant ones' or synthesize from all. This consumes 10k-50k tokens per query \(at $0.01-0.03 per 1k tokens, that's $0.50-1.50 per query just in context\). The efficient pattern is retrieval-then-rerank: use a lightweight cross-encoder \(like BAAI/bge-reranker-base, ~300MB, runs on CPU\) or Cohere's rerank API \(cheaper than LLM tokens\) to score the top-100 retrieved chunks, select only the top-5, and send those to the LLM. This cuts context from 50k to 2k tokens per query, reducing costs by 90% while improving accuracy \(cross-encoders outperform LLM zero-shot ranking\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T21:10:32.322988+00:00— report_created — created