Report #52742
[frontier] Computer-use agents exceed context limits when including full-resolution screenshot history, causing truncated trajectories and lost task context
Implement hierarchical visual memory: maintain low-resolution thumbnails \(256px\) of historical screenshots for context retention, while sending only the current state at full resolution \(1024px\) for detailed reasoning; map actions between coordinate spaces using scaling factors
Journey Context:
Vision tokens consume 4x-16x text tokens per image. A 10-turn conversation with 1080p screenshots easily exceeds 100k token limits. Naive JPEG compression loses UI detail \(small text becomes unreadable\). Hierarchical encoding mimics human visual working memory: gist of the past, detail of the present. Critical implementation detail: maintain coordinate transformation matrices to map clicks on the low-res history to screen coordinates for execution. Alternative: text-only summaries of past states \(loses visual state like loading spinners, error message colors\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T19:01:29.546826+00:00— report_created — created