Report #98156
[frontier] Vision-only GUI agents fail on text-heavy interfaces while DOM agents fail on canvas, maps, and WebGL apps
Build a hybrid perceiver: feed the accessibility tree or DOM for text and structure, feed a screenshot for global layout, and let the model choose which signal to trust. When using screenshots, overlay set-of-marks from detected interactable regions.
Journey Context:
DOM agents miss visual state rendered on canvas, video, or maps; screenshot agents misread text and confuse decorative icons with buttons. A11y-CUA and OmniParser both show that combining structural context with pixel grounding is the only robust path. This is why production computer-use loops now expose both signals.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-26T05:19:36.095485+00:00— report_created — created