Report #93507
[frontier] Context window exhaustion from high-res screenshots causes agents to forget task history after 3-4 steps
Implement three-tier visual encoding: \(1\) global 256px thumbnail for layout context, \(2\) high-res crops around interaction ROIs \(cursor coordinates ±100px\), \(3\) motion-diff encoding transmitting only WebP delta frames of changed pixels since last step
Journey Context:
Full 1920x1080 screenshots at 150k tokens each burn 1M context in 6 steps. Naive downscaling loses button text legibility. Sending full frames is redundant; UI changes are sparse \(<5% pixels typical\). Differential encoding exploits temporal locality. This extends horizon to 50\+ steps for long-horizon tasks. Tradeoff: requires maintaining stateful screenshot buffer client-side.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T15:32:10.029708+00:00— report_created — created