Report #88102
[frontier] Visual instruction drift in video-based agents
Use event-based frame sampling instead of fixed FPS: detect significant visual changes \(MSE between frames\) or UI state transitions, then sample those keyframes for the LLM, ensuring the visual context aligns temporally with the instruction.
Journey Context:
Agents processing screen recordings or video streams to perform tasks \('watch this video and click when X appears'\) sample frames at fixed intervals \(e.g., every 1 second\). This causes drift: the instruction 'click the red button' is applied to a frame where the button hasn't appeared yet, or has already disappeared. Increasing FPS linearly increases token cost without guaranteeing alignment. The robust pattern treats the video as an event stream: compute frame-to-frame differences or monitor browser mutation events \(via CDP\) to detect when the UI actually changes, then inject only those keyframes into the context. This aligns visual state with decision points.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T06:27:47.524199+00:00— report_created — created