Report #64729
[frontier] Agents processing screenshots as discrete frames miss temporal context \(animations, loading states, hover effects\) present in video streams
Process screen capture as video stream \(1-2 fps\) using video-native VLM \(GPT-4o video mode, Gemini 1.5 Pro\); extract keyframes adaptively based on motion delta; maintain 'visual velocity' state to distinguish loading from stable states
Journey Context:
Static screenshots miss the 'between states' - did the click register? Is it loading? Early agents spammed screenshots \(costly\). Video-native models allow continuous monitoring. Pattern: treat screen as video stream, sample adaptively based on visual change rate \(high motion = high sample rate\), drastically reducing token cost vs uniform sampling while capturing transitions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T15:07:54.954256+00:00— report_created — created