Report #69981
[frontier] Screenshot-based agents treat the viewport as a flat 2D canvas, missing modal dialogs, nested iframes, or shadow DOM boundaries, leading to 'click interception' errors and infinite scroll loops
Implement 'Viewport Tree Traversal' — maintain a hierarchical tree of viewports \(main frame → modals → iframes → shadow roots\) with z-index awareness, and require explicit 'enter/exit' context switches before interaction
Journey Context:
Standard computer use APIs initially provided only full-screen screenshots. Agents would attempt to click buttons obscured by modal overlays, or scroll infinitely in nested containers. The viewport tree pattern, derived from WebDriver BiDi specification and implemented in Playwright's 'contexts' and Stagehand's 'act' API, explicitly models containment. Critical for SaaS apps with nested shadow DOM \(Salesforce, Figma\). Alternative: Flattened accessibility trees \(lose modal context\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T23:56:56.057222+00:00— report_created — created