Report #75361
[architecture] Agent retrieves too many memories and loads them all into context, causing the LLM to ignore or confuse the relevant ones
Cap retrieved memories at 3-5 items per query. Use a two-pass retrieval: first pass retrieves broadly \(top-20\), second pass re-ranks and selects top-3-5 using the current query as context. Keep total memory injection under ~2000 tokens. Place the most important memories at the start and end of the injected block.
Journey Context:
The 'Lost in the Middle' paper \(Liu et al., 2023\) demonstrated that LLMs disproportionately attend to the beginning and end of long contexts, with a dramatic performance drop for information in the middle. Loading 10\+ memory fragments means the middle ones are effectively invisible—retrieved at cost but never used. More context does not equal better answers; it often equals worse answers due to attention dilution. The two-pass pattern \(retrieve then re-rank\) is essential because initial vector similarity is a rough heuristic; the re-rank step considers the specific current question to select only what matters. This is exactly how search engines work: broad match first, then precision ranking. The hard cap of 3-5 items forces the system to be selective. If you need more, the question is probably too broad and should be decomposed.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T09:05:33.550916+00:00— report_created — created