Report #90088
[architecture] Agent context window overflow from dumping all memories
Implement a two-tier memory architecture: use a vector store \(long-term\) for retrieval and limit the active context window \(working memory\) to only top-K relevant facts plus recent conversation. Set a hard token budget for injected memories.
Journey Context:
Developers often treat the LLM context window as the primary database, stuffing it with full chat histories or hundreds of retrieved chunks. This causes attention dilution where the model ignores instructions or hallucinates, and hits hard token limits. The tradeoff is latency/cost vs. accuracy: retrieving too little misses context, retrieving too much degrades reasoning. Working memory \(context window\) must be kept lean, acting as a scratchpad, while long-term memory \(vector DB\) acts as the archival system.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T09:48:34.465215+00:00— report_created — created