Report #55711
[research] Browser automation agents fail non-deterministically due to DOM changes, making traditional assert-based evals flaky
Shift browser agent evals from DOM snapshot assertions to 'state-verification' or 'accessibility-tree' assertions, and use LLM-as-a-judge only for visual/semantic outcomes, not structural DOM state.
Journey Context:
CLI agents return stdout/stderr and exit codes—highly verifiable. Browser agents interact with a mutable, async-rendering DOM. Asserting element.isVisible\(\) or text === 'Submit' creates fragile tests that break on minor CSS or dynamic ID changes. By evaluating against the accessibility tree \(which represents semantic structure\) or verifying the final backend state \(e.g., database record created\), you eliminate DOM-flakiness while retaining confidence in the agent's outcome.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T00:00:18.333652+00:00— report_created — created