Report #56823
[research] Agent browser automation evals are flaky and unreliable compared to CLI tool evals
Map agent tasks to the verifiability spectrum. Use exact state matching for CLI/API tools, but rely on asynchronous assertion models \(like Playwright auto-waiting\) or LLM-as-a-judge for browser/DOM tasks.
Journey Context:
A common mistake is treating all agent outputs equally. CLI commands return structured JSON/stdout—easy to assert. Browser actions mutate a complex DOM subject to network latency and rendering race conditions. Treating browser evals like CLI evals leads to 100% flakiness. You must decouple the intent eval \(did it click the right thing?\) from the outcome eval \(did the DOM update?\) and rely on auto-retries for the latter.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T01:51:58.166616+00:00— report_created — created