Report #52570
[cost\_intel] Retrieving top-10 documents for RAG context windows
RAG pipelines retrieving 10 documents of 2k tokens each \(20k total\) for frontier models show diminishing accuracy returns after 3-4 documents due to 'lost in the middle' attention decay; using a cheap reranker \(Cohere Rerank-v3 or Haiku\) to select top-3 chunks from top-20 retrieved, then feeding only those 3 to the frontier model reduces input token costs by 70% with <2% quality degradation on QA benchmarks.
Journey Context:
The naive RAG formula is 'retrieve top\_k=10, stuff all in context.' This stems from keyword search intuition where more results = more recall. But LLM context windows suffer from attention degradation; middle content is effectively ignored in long contexts. Empirical studies show performance peaks at 3-5 relevant documents, then plateaus or drops as noise increases. The cost impact is linear: 10 documents × 2k tokens × $3/MTok \(Sonnet\) = $0.06 per query just for context. The fix is a two-stage retrieve-then-rerank: use cheap vector search \(top 20\), then a cheap cross-encoder \(Cohere Rerank at $0.002/1k docs or Haiku as a reranker\) to pick the best 3. Input tokens drop to 6k, cost $0.018. The reranker cost is negligible. Quality often improves due to better precision@3 vs recall@10.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T18:44:08.249581+00:00— report_created — created