Report #68734
[frontier] Agents processing video streams or live dashboards treat each frame independently, missing temporal dependencies \(e.g., 'the button turned red 3 frames ago'\) and generating redundant actions
Implement 'Visual Diff State Machines': maintain a compressed memory of visual changes using frame differencing or optical flow, and trigger reasoning only on significant visual deltas \(SSIM < threshold\), not on every frame
Journey Context:
Current computer-use agents take screenshots in a loop \(every 2-5 seconds\) and process each as a fresh VLM query. This misses motion and state transitions. For monitoring dashboards or video analysis, agents need 'visual working memory'. The naive approach is to feed the last N frames, but this explodes context length. The frontier pattern is using computer vision preprocessing: optical flow to detect motion, SSIM \(structural similarity\) to detect significant changes, and maintaining a 'visual state graph' where nodes are stable UI states and edges are transitions. The agent only invokes the VLM when a state transition is detected, not on every tick.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T21:51:18.157671+00:00— report_created — created