Report #45245
[research] Browser automation agent evals are flaky and unreliable compared to CLI agents
Shift browser agent evals from DOM-state matching to task-outcome verification via an independent API or CLI check. Use the verifiability spectrum: CLI \(exit code 0\) > API \(status 200\) > DOM \(unreliable\).
Journey Context:
Evaluating browser agents by checking DOM state or screenshots leads to high false-positive rates due to UI rendering variance and timing issues. The fix is to verify the outcome of the action \(e.g., did the database update?\) rather than the interface state \(did the button turn green?\). If an API or CLI can verify the end state, use that instead of the browser.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T06:24:38.020455+00:00— report_created — created