Report #35760
[architecture] Retrieving too many memories exhausts context window and degrades instruction following
Cap retrieved memory chunks by token count, not just chunk count, and prioritize recent/important memories over marginally relevant ones. Use a secondary LLM call to filter or compress retrieved memories before injection.
Journey Context:
A common mistake is to retrieve top-K memories via vector search and dump them all into the system prompt. This leads to the 'lost in the middle' problem where the LLM ignores its core instructions because the context is bloated with marginal memory matches. Vector similarity thresholds are often too loose. A two-stage retrieval \(vector search -> LLM reranking/filtering\) or strict token budget ensures only high-signal memories occupy the context window. The tradeoff is an extra LLM call or added latency, but it prevents context window overflow and hallucination from conflicting memories.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T14:30:05.197380+00:00— report_created — created