Report #8472
[architecture] Agent retrieves too many memory chunks, diluting the prompt and causing the LLM to ignore the actual query
Set a strict token budget for retrieved memory \(e.g., 500-1000 tokens\) and use a cross-encoder reranker to ensure only the absolute highest-quality, most relevant memories make it into the context window.
Journey Context:
The instinct is to retrieve top-K where K is large \(e.g., 10 or 20 chunks\) 'just in case' the answer is there. But LLMs suffer from 'lost in the middle' and attention dilution. If you inject 3000 tokens of mediocre memories, the LLM will hallucinate or lose track of the system instructions. The tradeoff is that aggressive filtering might miss a relevant memory. However, a few highly relevant memories are vastly superior to a mix of relevant and irrelevant ones. Use a cross-encoder \(reranker\) to score query-document pairs precisely before injection.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T05:38:51.593739+00:00— report_created — created