Report #41252
[frontier] Agent attempts to click button that exists in accessibility tree but is visually hidden behind modal overlay
Implement 'hybrid perception' with conflict resolution - query both accessibility tree \(DOM\) and screenshot vision; when they disagree \(element visible in A11y but not in vision via OCR detection\), default to vision for action grounding and flag for human review
Journey Context:
DOM-based agents \(Playwright default\) fail on canvas apps \(Figma, Miro\) where the 'button' is just a drawn rectangle. Screenshot agents fail on lazy-loaded content that's in the DOM but not rendered. Common mistake is assuming A11y tree is ground truth - it's often stale or abstracted \(React portals\). Alternative is using browser CDP to force layout calculation, but that's slow. Hybrid perception treats vision as primary for action verification and DOM as metadata for semantic labeling. Tradeoff: 2x LLM calls per step or complex multimodal prompt.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T23:42:57.186308+00:00— report_created — created