Report #57765
[research] Unreliable browser-based agent evals flake and obscure true agent capability regressions
Separate eval suites into a verifiability spectrum. Run deterministic CLI/API-verifiable tasks as regression gates, and treat browser-based tasks as probabilistic smoke tests with high retry thresholds.
Journey Context:
Browser environments are non-deterministic \(load times, DOM changes, popups\), causing evals to flake and mask real regressions. CLI or API tasks return structured, deterministic data \(exit codes, JSON\). By splitting the suite, you get fast, reliable signal on core logic from CLI evals, while accepting the noise of browser evals rather than letting them block CI.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T03:26:52.557176+00:00— report_created — created