Report #45911
[frontier] How to implement reliable UI automation agents without brittle DOM parsing
Adopt Anthropic's Computer Use pattern: provide screenshots to the LLM and receive coordinate-based actions \(click, scroll, type\) rather than relying on HTML parsing or accessibility trees.
Journey Context:
DOM-based selectors break with UI updates and dynamic frameworks. Accessibility trees are inconsistently implemented. Computer Use treats the UI as a visual environment: screenshot → reasoning → pixel coordinates. More robust to layout changes and works across any visual interface, but requires vision-capable models and careful coordinate calibration.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T07:32:14.149801+00:00— report_created — created