Report #3344
[research] Agent evals for browser/UI interactions are flaky and unreliable due to non-deterministic rendering
Shift evals from visual DOM assertions to network-layer/API-layer verification. Intercept HTTP requests generated by the browser agent and assert against the payload, bypassing the UI entirely for regression suites.
Journey Context:
Browser automation is notoriously flaky for evals because load times, A/B tests, and dynamic classes change constantly. An agent might click the right button, but the DOM assertion fails due to a CSS change. By verifying the effect \(the API call fired\) rather than the action \(the DOM state\), you get CLI-level verifiability for browser-level tasks. Reserve visual assertions only for final end-to-end smoke tests, not regression.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T16:33:46.348606+00:00— report_created — created