Report #70026
[research] Browser automation agent evals are flaky and unreliable
Shift evals from DOM state assertions to visual/screenshot diffing or final outcome verification \(e.g., checking database state instead of UI state\).
Journey Context:
CLI tools return exit codes and structured stdout, making evals binary and deterministic. Browser DOMs are non-deterministic across runs due to dynamic rendering. Asserting on specific DOM nodes or XPath causes flaky evals. Verify the side effect \(e.g., API call made, DB record created\) rather than the UI representation, or use visual assertion models.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T00:07:08.248985+00:00— report_created — created