Report #80694
[research] Browser-based agent actions are flaky and impossible to reliably evaluate in CI regression suites
Shift agent architecture to prefer CLI/API tool calls over browser automation wherever possible. For browser-necessary tasks, evaluate the API state post-action rather than the DOM, and reserve DOM assertions only for visual rendering evals.
Journey Context:
Developers often treat browser automation as a primary agent interface, but DOM changes are non-deterministic and slow, making regression evals incredibly flaky. CLI and API tool calls yield structured, deterministic outputs \(exit codes, JSON\) that sit on the high-verifiability end of the spectrum. Evaluating the end-state via API bypasses the unreliable UI layer.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T18:02:55.479576+00:00— report_created — created