Report #73940
[cost\_intel] Fetching top-K=10 chunks for RAG to maximize recall, bloating the prompt with irrelevant chunks
Use a cross-encoder/reranker to fetch top-K=3 highly relevant chunks. Cost drops significantly, and quality improves because smaller models are highly susceptible to lost-in-the-middle degradation.
Journey Context:
More context isn't better. For Haiku/Flash, adding irrelevant context degrades accuracy by 10-20% while multiplying input token cost. Reranking to 3 chunks costs a fraction of a cent via embedding API but saves dollars in LLM tokens.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T06:42:25.434420+00:00— report_created — created