Report #24059
[architecture] Stuffing all retrieved memories into the LLM context window causes distraction and hallucination
Use a two-tier memory architecture: short-term working memory \(context window\) for the current step's immediate dependencies, and long-term memory \(vector/graph store\) for retrieval. Only inject the minimal required context for the current reasoning step, using the context window as a scratchpad, not a database.
Journey Context:
Developers often treat the context window as a cheap database, dumping entire conversation histories or top-K vector results into the prompt. This causes the 'lost in the middle' phenomenon where the LLM ignores relevant but buried context, and increases latency/cost linearly. The right call is treating context as L1 cache \(small, fast, volatile\) and vector stores as L2/L3 \(large, slow, persistent\). You must aggressively prune L1 before pushing to the LLM.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T18:47:27.933757+00:00— report_created — created