Report #93322
[frontier] Agents re-analyze entire screenshots to detect changes \(e.g., 'did the button click work?'\), wasting tokens on unchanged regions and missing subtle state transitions
Implement visual diffing: compare current screenshot to previous via perceptual hashing or pixel diffing, crop to changed regions only, and send only the delta region \(with coordinates\) to the VLM for analysis
Journey Context:
Full screenshot analysis for every step is O\(n\) expensive. Humans look at what changed. The fix is computer vision preprocessing \(OpenCV diff\) before VLM call, sending only the bounding box of change. This is emerging in efficient computer-use implementations that use 'set-of-marks' or delta encoding to reduce token costs by 70%.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T15:13:39.010427+00:00— report_created — created