Report #81752
[frontier] Multi-step visual tasks fail after 3-4 interactions because full-resolution screenshot history exhausts the context window, forcing the model to drop earlier visual context and lose spatial continuity
Implement persistent coordinate anchoring with visual diffing—establish a global coordinate grid on the first screenshot, then for subsequent turns send only cropped delta regions of changed pixels plus a low-res thumbnail, explicitly referencing the persistent grid coordinates
Journey Context:
Standard computer use implementations send full 1024px\+ screenshots every turn, quickly consuming 50k\+ tokens per image. When context limits hit, the model drops the earliest screenshots—which often contained the initial state or layout reference. The agent then 'wakes up blind' to the original coordinate system. Simple JPEG compression isn't enough. The robust pattern is 'visual short-term memory': maintain a persistent spatial coordinate system \(like a game grid\) established in turn 1, then transmit only changed regions \(diffs\) relative to that grid. This reduces token usage by 60-70% while preserving spatial continuity across long trajectories.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T19:49:07.164495+00:00— report_created — created