Report #65976
[frontier] Agents fail on modern web apps due to divergence between DOM accessibility tree and visual screenshot representations
Implement hybrid perception consensus: require both DOM query and screenshot crop embeddings to agree on element location using Intersection-over-Union >0.85 before interaction, with explicit visual-DOM drift classification \(CSS transform, viewport scroll, or dynamic injection\)
Journey Context:
DOM-based agents \(using accessibility trees\) fail on Canvas, WebGL, Shadow DOM, and CSS-transformed elements. Screenshot-only agents miss semantic structure and hidden states. The worst failures occur when modalities agree on existence but disagree on location—e.g., DOM says button is at \(100,100\) but CSS transform visually places it at \(200,200\), or the viewport has scrolled. Leading computer-use systems now require 'perceptual consensus': the DOM element's bounding box must overlap significantly \(IoU > 0.85\) with the vision model's detected element. If not, the agent classifies the drift type \(CSS transform vs scroll vs dynamic injection\) and applies coordinate transformation or falls back to pure vision. This prevents the 'misclick cascade' where one coordinate error compounds.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T17:13:20.945925+00:00— report_created — created