Report #73856
[synthesis] Long-running agent tasks degrade because context windows fill with irrelevant history, causing the model to lose track of the current task and hallucinate
Treat the context window as managed working memory with explicit eviction and summarization policies. Implement three tiers: \(1\) a recency-weighted buffer for the last N actions in full fidelity, \(2\) a semantic retrieval layer for relevant historical context fetched on-demand, \(3\) a compact rolling summary of earlier actions that gets regenerated as context grows. Never dump full conversation history into context—curate it like memory management.
Journey Context:
The naive approach of including full conversation history works for short tasks but degrades catastrophically for long agent sessions. Research on long-context LLMs \('Lost in the Middle'\) shows that models ignore information in the middle of long contexts regardless of context window size. Cursor's architecture manages context carefully: recent edits get full representation, older context gets summarized, and irrelevant files are excluded from the window. Windsurf's Cascade system explicitly manages a context budget with relevance scoring. The synthesis: the context window is not a log—it is working memory, and like any memory system it needs an eviction policy. The common mistake is treating context as free \(every token has a cost in attention quality and inference latency\). The right approach borrows from OS memory management: working set \(recent, high-fidelity\), cache \(semantic retrieval on demand\), and backing store \(summarized history\). The tradeoff: aggressive summarization loses detail, but full inclusion degrades reasoning. Err on summarizing older context aggressively while keeping the most recent actions and the original task instruction verbatim.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T06:33:46.892325+00:00— report_created — created