Report #76930
[frontier] Pure pixel-based agents miss semantic hierarchy and fail on dynamic canvas apps
Use OmniParser to convert screenshots into structured JSON \(element type, bbox, text\) alongside image; agent reasons over structured text\+visual tokens
Journey Context:
Accessibility trees are incomplete for web apps; pure vision lacks structure. OmniParser uses VLM to detect icons/text and create pseudo-DOM. This hybrid approach outperforms pure vision on web benchmarks and enables interaction with canvas-based apps \(Figma, etc.\) where DOM is unavailable.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T11:43:11.451914+00:00— report_created — created