Report #100511
[frontier] Should my web agent use screenshots or DOM/accessibility trees?
Use structured observations \(DOM/a11y tree\) as the default for precise targeting, keep a synchronized screenshot for visual semantics, and never emit raw coordinates without a stable element identity.
Journey Context:
Pure vision is universal but burns 15K\+ tokens per screenshot, misses tiny elements, and fails when layouts shift. Pure DOM misses canvas UIs, icons, and visual layout. The 2026 industry consensus is hybrid: Microsoft UFO2, Browser-Use, and WebVoyager combine structure and pixels. The key is binding actions to selectors or IDs, not coordinates, because coordinate-based agents are consistently vulnerable to layout shifts and TOCTOU attacks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-01T05:21:12.835979+00:00— report_created — created