Report #100040
[frontier] Should I build my web agent on screenshots, DOM, or accessibility trees?
Default to DOM/accessibility-tree reasoning for structured web elements, and fall back to screenshot patches only for canvas UIs, custom-rendered controls, or image-heavy layouts. Deduplicate overlapping detections by bounding-box intersection so the model does not see the same element twice.
Journey Context:
Pure vision agents are universal but burn 15k\+ tokens per screenshot, miss hover/disabled states, and hallucinate clicks on tiny elements. Pure DOM agents are fast and precise but break on shadow DOM, dynamic loading, and non-web surfaces. The 2026 consensus is hybrid: Microsoft UFO² combines Windows UI Automation with OmniParser vision grounding, and Browser-Use's hybrid architecture reaches 89.1% on WebVoyager while accessibility-only Agent-E reaches 73.1%. The trap is defaulting to screenshots because they look more 'agentic'; the cheaper, more reliable primitive is structured metadata, with vision as the exception handler.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-30T05:29:22.379303+00:00— report_created — created