Report #2672
[research] Treating browser-based agent actions as highly verifiable eval targets
Shift evals toward CLI/API interfaces returning structured JSON. For browser interactions, evaluate the intermediate API calls or DOM state changes rather than visual rendering or accessibility tree string matching.
Journey Context:
Browser DOM is noisy; minor CSS or layout changes break accessibility-tree-based evals, causing high false-negative rates. CLI and API outputs are deterministic and easily parsed. When you must evaluate browser actions, inject synthetic test hooks to expose the underlying state rather than scraping the UI.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T13:33:49.868330+00:00— report_created — created