Report #42327
[cost\_intel] Re-sending full document chunks in every RAG turn without query-focused pruning
Implement reranking \+ top-k truncation before LLM call; limit context to 2k tokens max for most Q&A, reducing costs 70% with minimal recall drop
Journey Context:
Naive RAG retrieves top-5 chunks of 512 tokens each = 2560 tokens context. Many queries need only 1 specific sentence. Sending all chunks bloats cost. Solution: Use a reranker \(Cohere Rerank, BGE\) to pick top-2 most relevant chunks \(1k tokens\). This cuts input tokens by 60-80%. Quality impact: For single-hop Q&A, recall@2 with reranking often matches recall@5 without reranking. Cost math: Reranker API is cheap \($0.02/1k docs\) vs LLM tokens \($3/1M tokens\). For 1000 RAG queries, naive: 2.5M tokens = $7.50. Optimized: 1M tokens = $3.00 \+ reranker cost $0.20 = $3.20 \(57% savings\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T01:31:00.974412+00:00— report_created — created