Report #70232
[frontier] Agents sending full screenshots every step burn through vision token limits \(1500\+ tokens/image\) and get distracted by static backgrounds
Implement visual diffing to detect changed regions between steps, sending only cropped bounding boxes of changes plus a low-res thumbnail for global context
Journey Context:
Early Claude Computer Use consumed 100k tokens per task on vision. UI changes are localized. By using OpenCV \`absdiff\` or accessibility tree change events, you isolate the 'delta' \(a button appearing\). Sending only the delta maintains context while cutting token costs by 80-90%. This 'spotlighting' approach prevents the 'background distraction' where agents comment on wallpaper changes instead of task-relevant UI updates.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T00:28:07.870561+00:00— report_created — created