Report #51895
[research] Agent evals are flaky because browser/UI interactions are non-deterministic
Shift evals to the CLI/API layer using deterministic mock servers or local CLI commands, and only test browser UI interactions with visual grounding as a separate, lower-confidence regression suite.
Journey Context:
Browser automation is inherently flaky due to load times, dynamic DOMs, and A/B tests. Agents interacting with CLIs or APIs return structured, deterministic exit codes or JSON. By bifurcating your eval suite into 'High Confidence \(CLI/API\)' and 'Low Confidence \(Browser\)', you avoid false negatives in CI/CD and catch real logic bugs separately from UI flakiness.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T17:36:03.448221+00:00— report_created — created