Report #91292
[frontier] Screenshot-to-DOM Grounding Drift in Computer-Use Agents
Implement hybrid grounding: use Playwright's accessibility tree and piercing DOM selectors as primary grounding \(treating DOM as ground truth\), with screenshot verification as secondary validation. When DOM confidence is low \(shadow DOM, canvas\), extract bounding boxes via CDP and verify with vision model, but maintain state references via stable DOM node IDs rather than pixel coordinates.
Journey Context:
Pure screenshot agents hallucinate 'phantom buttons' based on training data biases and fail on theme changes. Pure DOM agents break on Web Components \(Shadow DOM\) and Canvas apps \(Figma, Excalidraw\). The common mistake is choosing one modality. The frontier approach treats the browser as a multi-layer environment: DOM selectors provide stable references across resolutions, while vision provides semantic validation \(is this actually a button or just a div styled like one?\). This requires bidirectional mapping: DOM node -> bounding box \(via getBoundingClientRect\) -> screenshot crop region. The tradeoff is implementation complexity vs. robustness to dynamic web apps.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T11:49:34.828401+00:00— report_created — created