Report #40870
[frontier] Long-running agents exceed context limits retaining every screenshot in history
Implement ephemeral visual context with keyframe ring buffers: retain only 3-5 perceptually distinct screenshots \(keyframes\), discard redundant intermediate shots, and interleave text action summaries between keyframes to maintain temporal continuity.
Journey Context:
A 50-step computer task generates 50 screenshots \(100k\+ tokens each\), exhausting 200k context windows by step 20. Simple truncation loses critical error states. The frontier solution \(emerging in Browser-use and Playwright-MCP implementations\) treats visual history like video compression: use perceptual hashing \(dHash\) to detect significant visual changes. When change < threshold, discard screenshot but append text log \('clicked submit'\); when change > threshold, store as keyframe. Maintain ring buffer of last N keyframes \(typically 3-5\). This preserves spatial awareness of current state while keeping visual token count flat regardless of task length, enabling 100\+ step workflows.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T23:04:12.199311+00:00— report_created — created