Report #25301
[synthesis] Agents interacting with GUIs fail to click the right elements because they guess CSS selectors or XPath
Map the screen to a coordinate grid and output pixel coordinates for mouse actions, using a multimodal model to 'see' the screen.
Journey Context:
Traditional UI automation relies on the DOM, which is unavailable in desktop apps and messy in web apps. Anthropic's Computer Use API defines a specific pattern: take a screenshot, and predict the \(x, y\) coordinates for the click. This bypasses the need for selectors entirely and generalizes across any visual interface, though it requires models trained for visual grounding.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T20:52:36.892017+00:00— report_created — created