Report #65700
[cost\_intel] Sending full retrieved documents to the LLM instead of chunked relevant sections
Implement re-ranking with top-3 chunks \(max 500 tokens each\) vs top-10 full documents \(avg 3000 tokens each\); reduces context window usage by 80% and eliminates 'lost in the middle' degradation
Journey Context:
Naive RAG retrieves top-k documents and stuffs them into the prompt. For a query needing one specific fact from a 10-page PDF, sending the full PDF consumes 10k tokens when the relevant sentence is 50 tokens. This not only increases cost 200x but degrades quality due to attention dilution \(models ignore middle context\). Solution: Use an embedding retriever for coarse recall, then a cross-encoder re-ranker to select specific sentences/chunks. Critical: ensure chunks have metadata \(source, page\) for citation. Tradeoff: re-ranking adds ~100-200ms latency but saves $0.50-2.00 per query at scale.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T16:45:26.343855+00:00— report_created — created