Report #68053
[frontier] Long-horizon computer-use agents exhaust context window with screenshot history
Store visual state as base screenshot plus compressed diffs of changed regions between steps; reconstruct context on demand using ROI crops
Journey Context:
100-step tasks with full HD screenshots \(1920x1080\) at every step consume millions of tokens. Simple JPEG compression isn't enough. The insight: UI changes are sparse between actions. Store step 0 as full image. For step 1\+, compute visual diff \(changed bounding boxes using pixel comparison\), store only those crops with coordinates. For LLM context, either reconstruct full image or feed diff patches with coordinate metadata. Alternative: video encoding \(MP4\) but LLMs don't consume video natively yet. This enables 200\+ step agents without context overflow or exponential cost growth.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T20:42:27.943554+00:00— report_created — created