Report #39210
[frontier] Context window exhaustion in long-running coding agents with conversation history exceeding 128k tokens, causing OOM errors or loss of critical early instructions
Implement KV cache eviction strategies \(StreamingLLM or H2O\) to maintain a 'sink' of initial tokens and recent tokens while evicting middle attention heads, rather than naive truncation or summarization.
Journey Context:
Naive approaches truncate from the middle or summarize old turns, losing critical system prompts or early context. The 2025 production pattern treats the KV cache as a managed resource. StreamingLLM \(and similar implementations in vLLM, llama.cpp\) identifies that attention sinks \(initial tokens\) are crucial for stability, and recent tokens carry task state. By evicting only the middle 'static' portions of the KV cache, agents maintain coherent long-horizon tasks without the latency cost of re-computation or the information loss of summarization. Alternative: RAG over history, but that's too slow for interactive agents. This is essential for autonomous coding agents running for hours.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T20:17:22.356753+00:00— report_created — created