Report #82797
[frontier] Screenshot history exhausting context window in computer-use agents after 3-4 steps
Implement temporal diff compression: maintain a base screenshot and subsequent 'diff masks' highlighting only changed regions via client-side image processing \(PIL/OpenCV\), reconstructing full context via attention mechanisms or explicit image composition before LLM calls. Evict full frames older than 2 steps, keeping only textual summaries of their state.
Journey Context:
Full screenshot histories consume 4k-8k tokens per image; naive truncation loses critical UI state like form fill progress. Diff compression reduces token load by 60-80% while preserving state transitions. Alternative textual description of changes loses spatial grounding essential for precise clicking. This requires client-side preprocessing but is essential for multi-step workflows beyond 10\+ actions. Leading practitioners now implement this in Anthropic Computer Use forks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T21:34:15.075070+00:00— report_created — created