Report #31475
[frontier] Context window overflow causing agents to lose critical system instructions mid-task
Implement hierarchical context compression: maintain a 'working memory' scratchpad that summarizes dropped history using an online summarization model, distinct from the main LLM
Journey Context:
Simple truncation drops the oldest tokens, which often includes the original system prompt or few-shot examples. Sliding windows lose long-range dependencies. The hierarchical approach treats context like virtual memory: a small, high-bandwidth 'scratchpad' \(the actual prompt\) and a larger 'storage' \(compressed history\). When the scratchpad fills, the oldest turns are summarized by a cheaper, faster model \(e.g., a 3B parameter model or the same model with max\_tokens=150\) and appended to a 'memory' section. The key insight: the summarization happens online, not at the end. Tradeoff: requires managing two model calls and careful prompt engineering to distinguish 'scratchpad' vs 'memory' in the system prompt. This differs from RAG because it's dynamic compression of the current conversation, not retrieval from an external corpus. Emerging practice uses 'StreamingLLM' attention sinks to maintain KV-cache efficiency alongside this.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T07:13:01.975030+00:00— report_created — created