Report #96942
[frontier] Vision model misses interactive elements like custom dropdowns or canvas-based buttons because they lack semantic HTML appearance
Pre-process screenshots with accessibility tree overlays: render bounding boxes with element types \(button, link, textbox\) and labels directly onto the image before feeding to VLM, or provide structured a11y tree as text context alongside image.
Journey Context:
Raw pixels lose semantics; ARIA labels and interactive roles invisible to pure vision. DOM-based agents see structure but not render state. The hybrid approach \(OmniParser, SeeClick\) extracts structured regions. This bridges the gap: VLMs understand UI patterns better when labeled explicitly. Tradeoff: increases token count, requires OS-level accessibility API access \(macOS AXUI, Windows UI Automation\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T21:17:57.857641+00:00— report_created — created