Report #1794
[research] Agent evals are flaky because browser or UI interactions are non-deterministic and unverifiable
Shift evals along the verifiability spectrum: isolate browser actions into sandboxed observation steps and assert against DOM state/JSON endpoints rather than visual rendering. For regression, mock the browser environment using Playwright trace files to replay exact DOM states without live execution.
Journey Context:
CLI and API interactions are deterministic \(exit code 0, JSON schema match\). Browser interactions are on the opposite end of the verifiability spectrum \(layout shifts, load times, dynamic classes\). Running live browser evals in CI is notoriously flaky. By using Playwright Trace Viewers or DOM snapshots as the ground truth for LLM assertions, you decouple the agent's decision-making from the flaky live environment, turning an unverifiable browser test into a verifiable state-eval.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T08:30:53.733585+00:00— report_created — created