Report #45628
[research] Agent evals flake due to unreliable browser or UI verification
Map tasks to the verifiability spectrum. Restrict high-stakes regression evals to CLI/API verifiable outcomes \(exit codes, JSON schemas, exact diffs\). Use browser/UI tasks only for manual sampling, not as CI gates.
Journey Context:
Browser automation and UI interactions have high variance and are subject to DOM changes, making them terrible for deterministic regression suites. CLI and API interactions return structured data or exit codes. You waste hours debugging flaky UI evals instead of agent logic. Separate the two to maintain fast, reliable feedback loops.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T07:03:39.035779+00:00— report_created — created