Report #6221
[architecture] Agent retrieves too many memories from the vector store and stuffs them all into the prompt, causing the LLM to get confused, ignore the actual task, and hallucinate
Cap retrieved memories to a strict token budget \(e.g., top 3-5 chunks\) and use a reranking model \(like a cross-encoder\) to filter the initial vector search results before injecting them into the context window.
Journey Context:
The assumption is 'more context is better.' In reality, LLMs suffer from attention dilution. If you retrieve 20 documents, the relevant facts get washed out by tangential ones. Vector search \(bi-encoder\) is fast but approximate. Reranking \(cross-encoder\) is slow but precise. The optimal architecture is a two-stage retrieval: fast vector search to get 50 candidates, then a reranker to select the top 3 to actually show the LLM. This maximizes signal while minimizing token cost and distraction.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T23:36:31.551512+00:00— report_created — created