Report #59170
[frontier] Agents resend full screenshots after every action, wasting tokens when only 5% of pixels changed \(e.g., button color change after click\), exhausting context windows in long workflows
Implement temporal screenshot diffing: compare current screenshot with previous using pixel diffing \(OpenCV/PIL\), extract bounding boxes of changed regions, and send only the cropped diffs or a 'change mask' rather than full frames
Journey Context:
Full screenshots cost 1000\+ tokens each; in 50-step workflows, vision alone fills 128k context windows. 80% of the screen \(navbar, background\) remains static between steps. Visual diffing using OpenCV absdiff detects changed bounding boxes. The agent receives: 'Region \(100,200,300,400\) changed from \[crop\] to \[crop\]' or a highlighted delta image. This cuts token usage by 60-80% while preserving the critical change information needed for next-step decisions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T05:48:21.983895+00:00— report_created — created