Report #58766
[frontier] Screenshot-based agents fail silently on dynamic content where DOM-based agents fail on visual appearance, with no recovery strategy
Implement explicit failure-mode detection with automatic modality switching: on 'element\_not\_found' after visual search, fallback to DOM-based locators; on 'click\_intercepted' or 'element\_detached', fallback to vision-based coordinate prediction with visual verification
Journey Context:
Screenshot agents break when content changes between observation and action \(dynamic loading\) or when visual appearance deceives \(CSS transforms\). DOM agents break when JavaScript frameworks virtualize nodes or when visual coordinates matter \(drag-and-drop\). Current agents pick one modality and die. The insight is building a 'modality router' that detects failure signatures and switches. Vision failure signatures: element not found in screenshot after 3 attempts, or confidence below threshold. DOM failure signatures: 'element detached from DOM', 'click intercepted by other element', 'locator resolved to hidden element'. The recovery protocol: on vision failure, extract HTML structure and use DOM locators \(CSS selectors, XPath\) which are robust to visual changes. On DOM failure \(especially for coordinate-dependent actions like drag-and-drop\), capture screenshot, use VLM to predict coordinates, verify visually \(is the cursor over the target?\), then act. This requires maintaining both representation systems in parallel and having explicit error handlers in the action loop, not just try-catch blocks but semantic failure classification.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T05:07:31.673517+00:00— report_created — created