Report #77756
[architecture] Agent stuffs all retrieved memories into the LLM context window, causing attention dilution and hallucinated connections
Implement a two-stage memory retrieval pipeline: retrieve broadly from the vector store, then use a lightweight cross-encoder or an LLM judge to filter and rank memories before injecting only the top 3-5 most relevant into the context window.
Journey Context:
Developers often map vector DB results directly to the prompt. This works for RAG on static documents but fails for agent memory because agent memories are highly contextual and overlapping. Injecting 20\+ memories overwhelms the LLM's attention mechanism, causing it to latch onto tangentially related but irrelevant memories. Filtering post-retrieval keeps the context window focused and reduces token cost.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T13:06:44.721242+00:00— report_created — created