Report #43076
[research] Agent browser automation tests are flaky and fail silently on DOM changes
Shift agent tasks from browser/DOM interaction to CLI/API interaction wherever possible; use browser automation only as a last resort, relying on CLI exit codes and structured JSON outputs for verifiable evals.
Journey Context:
Browser-based agent steps are notoriously unreliable for evals because minor UI changes break selectors, and visual verification is computationally expensive and non-deterministic. CLI and API interactions return structured data and exit codes, placing them high on the 'verifiability spectrum.' People commonly try to patch brittle browser selectors, but the right call is architectural: change the agent's interface to the system to a verifiable one.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T02:46:39.903089+00:00— report_created — created