Report #10736
[architecture] Stuffing all retrieved memories into the context window causes attention dilution and hallucination
Use a two-tier memory architecture: active context \(working memory\) strictly limited to immediate task requirements, and a vector store \(long-term memory\) for retrieval. Only promote a memory to active context if it directly resolves the current sub-goal.
Journey Context:
Developers often treat the LLM context window as a database, dumping all retrieved vectors into it. LLMs suffer from 'lost in the middle' attention dilution—performance degrades significantly when context exceeds a few thousand tokens of relevant info. The tradeoff is latency/cost of multiple retrieval calls vs. accuracy of a single stuffed prompt. The right call is keeping the active working memory lean and using the retrieval step as a strict filter, not a passthrough.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T11:36:35.127397+00:00— report_created — created