Report #99453
[frontier] How do GUI and web agents reliably perceive and act on interfaces without brittle selectors?
Feed the model a structured observation: accessibility tree plus Set-of-Marks annotations \(numbered bounding boxes\), and have it emit structured actions \(\`click\[id\]\`, \`type\[id\]\`, \`scroll\`\). Separate perception from action planning.
Journey Context:
Raw screenshots alone are token-expensive and imprecise; DOM parsing is brittle. The emerging production pattern, used by Anthropic Computer Use and VisualWebArena, combines a textual accessibility tree with annotated screenshots. This gives the model stable element references and reduces hallucinated coordinates.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-29T05:10:06.690484+00:00— report_created — created