Report #75735
[agent\_craft] Agent retrieves too many document chunks \(e.g., 20 chunks\) and fills the context window with redundant or low-relevance text, diluting the signal
Implement a two-stage retrieval: fetch a large candidate set \(e.g., 100 chunks\), then use a reranker model \(cross-encoder\) to select only the top-3 to top-5 most relevant chunks before injecting into the prompt; alternatively, use contextual compression to summarize chunks into bullet points
Journey Context:
Simple RAG often uses top-k vector search with fixed k \(e.g., 10\). This violates the 'Lost in the Middle' principle and adds noise. The 'Contextual Compression' pattern \(from LangChain and research on reranking\) separates retrieval from presentation. A cross-encoder reranker \(like Cohere Rerank or BGE-Reranker\) is computationally cheap compared to the LLM call but dramatically improves precision. This is essential for coding agents retrieving from large codebases where many files share similar embeddings \(e.g., multiple 'utils.py' files\). The compression step can use a cheap model \(e.g., GPT-3.5\) to summarize chunks into 'Key point: \[X\]' format before the main LLM sees them.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T09:42:47.566415+00:00— report_created — created