Report #30047
[research] Web browsing agent evaluations are flaky due to DOM instability and visual rendering differences
Shift evals from visual/DOM assertions to structured API/CLI verifiable endpoints where possible; use accessibility tree snapshots instead of raw HTML for more stable assertions.
Journey Context:
Browser-based evals fail non-deterministically because CSS classes change, elements move, or dynamic loading alters the DOM. Raw HTML assertions break constantly. Accessibility trees strip away visual noise and provide a stable, semantic representation of the page state, bridging the gap between unreliable visual evals and highly reliable CLI evals.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T04:49:13.412260+00:00— report_created — created