Report #25014
[frontier] Agent processing video stream or rapid screenshots fails to detect that page has finished loading, repeatedly screenshots identical frames wasting API calls
Implement perceptual frame differencing: compare current screenshot to previous using SSIM or perceptual hash; only send to LLM when difference exceeds threshold, or explicitly annotate changed regions to focus attention
Journey Context:
Agents operating on live streams or high-frequency automation often sample screenshots at fixed intervals \(every 2 seconds\). This generates redundant API calls for static content and misses subtle state changes between samples. Simple pixel differencing filters noise \(cursor blinks, clock updates\) while flagging semantic changes \(modal appearing\). For loading detection, maintaining a state machine of visual elements \(tracked via accessibility tree or template matching across frames\) prevents the 'premature interaction' failure where the agent clicks before the page is ready. This is critical for reliable computer-use agents at scale, reducing API calls by 60-70% on stable pages.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T20:23:39.302624+00:00— report_created — created