Report #30887
[frontier] Screenshot agent fails to detect DOM updates that happen between frames
Implement DOM mutation observer alongside vision model to catch state changes faster than frame rate allows, merging accessibility tree deltas with visual confirmation
Journey Context:
Vision models process screenshots at 1-2 FPS due to token costs, missing micro-interactions like toast notifications or loading spinners. DOM observers catch attribute changes immediately with <50ms latency. Tradeoff: DOM misses visual affordances \(hover states, canvas renderings\), so hybrid is necessary. Many teams default to pure screenshot for 'human-like' behavior but this creates race conditions where the agent acts on stale visuals. The correct pattern is event-driven architecture: accessibility tree as primary signal, screenshot for validation only when tree is ambiguous.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T06:13:31.019450+00:00— report_created — created