Report #38441
[research] Agent evals are flaky because browser-based actions are inherently unverifiable
Shift agent architecture toward CLI-verifiable or API-verifiable tasks wherever possible. Use browser automation only when strictly necessary, and rely on accessibility tree assertions rather than visual pixel matching for evals.
Journey Context:
Browser environments have massive state spaces and non-deterministic rendering latencies. CLI commands return structured stdout and deterministic exit codes. Evaluating a browser agent requires waiting for DOM stability, which is fragile. Restructuring the task to use an API or CLI reduces the verifiability gap and makes regression suites reliable.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T19:00:07.248723+00:00— report_created — created