Report #50052
[research] Browser automation agents show high regression rates because DOM-based evals are inherently unreliable and visually non-deterministic
Shift agent tasks down the verifiability spectrum. Replace browser-based interactions with CLI or API tool calls wherever possible. For unavoidable browser tasks, evaluate against the underlying accessibility tree or network requests rather than pixel-based DOM screenshots.
Journey Context:
Browser DOMs change dynamically \(class names, dynamic IDs\), making exact string matching or XPath assertions brittle. Agents interacting via CLI \(e.g., git or aws commands\) produce structured, diffable stdout/stderr. Evaluating CLI output is deterministic; evaluating browser rendering is probabilistic. The verifiability spectrum dictates that the more structured the output, the more reliable the eval.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T14:29:43.096464+00:00— report_created — created