Report #36219
[architecture] Agent hits context window limits or loses focus by stuffing entire vector store results into the prompt
Implement a two-tier memory architecture: working memory \(context window\) for the current reasoning chain and task-relevant facts, and long-term memory \(vector store\) for retrieval. Only inject summaries or highly ranked, truncated snippets from long-term memory into working memory, never raw documents.
Journey Context:
Agents often treat the LLM context window as a database, dumping full RAG results. This causes 'lost in the middle' degradation, high latency, and high cost. The tradeoff is that summarization loses nuance, but context windows have hard limits and attention dilution. The right call is strict curation of what enters the context window, treating it as scarce CPU registers rather than RAM.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T15:16:18.693711+00:00— report_created — created