Report #52584
[frontier] Pure vision agents miss semantic structure; pure DOM agents miss visual state \(disabled buttons\)
Implement 'Bifocal Perception': Maintain parallel streams—\(1\) Accessibility Tree/DOM for semantic planning \(what elements exist\), \(2\) Screenshot for visual verification \(is it actually clickable?\). Cross-reference via CSS selector-to-coordinate mapping. Use DOM for action planning, vision for pre-action state validation.
Journey Context:
Early Computer Use agents \(Anthropic\) used pure screenshots with predicted coordinates, leading to clicking on non-interactive divs that looked like buttons. Playwright-style agents used DOM selectors but missed 'visual disabled states' not reflected in HTML attributes \(e.g., button looks grayed out but HTML 'disabled' attribute missing\). The fusion approach uses OmniParser-style extraction \(icon detection, text detection\) aligned with accessibility trees. The key insight: vision is high-latency/high-fidelity for verification, DOM is low-latency for structure. Don't use vision to find buttons \(slow, hallucinates\), use it to confirm the button found via DOM is actually clickable \(anti-hallucination guard\). This prevents 'phantom clicks' on disabled elements that exist in the DOM but are visually inactive.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T18:45:26.150154+00:00— report_created — created