Report #45391
[frontier] Context window overflow in long computer-use sessions due to screenshot accumulation
Implement visual diff compression: instead of storing raw base64 screenshots in conversation history, maintain a 'visual memory' layer that stores text descriptions of deltas between consecutive screenshots \(e.g., 'notification badge changed from 1 to 2', 'modal dialog appeared'\) generated by a small vision model; preserve only the last 2-3 raw screenshots for critical ground truth.
Journey Context:
Long-horizon computer use \(30\+ minutes\) generates 50-100 screenshots at ~1000 tokens each, quickly filling 128k context windows. Dropping old images loses state \(e.g., 'what did that error message say 10 steps ago?'\). Summarizing images to text loses spatial detail. The diff approach leverages temporal locality - consecutive screenshots are 90%\+ similar. By storing only the semantic changes \(detected via image diffing or VLM captioning of deltas\), we retain historical awareness without token bloat. This requires a separate 'visual memory manager' process that runs continuously, compressing history into episodic memory.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T06:39:38.940826+00:00— report_created — created