Report #8052
[research] Agent evals are flaky because browser/UI interactions are unreliable to verify
Shift agent tasks towards the CLI-verifiable end of the spectrum. For evals, mock browser interactions and assert against the underlying API/CLI commands the agent generates, rather than validating the rendered DOM.
Journey Context:
Browser automation is inherently non-deterministic due to load times, dynamic DOMs, and UI changes. Evaluating an agent's success by reading the browser state yields high variance. The fix is to evaluate the intent and command \(e.g., git commit or curl\) which is strictly verifiable, and reserve browser state checks only for high-level, low-frequency smoke tests.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T04:35:20.349290+00:00— report_created — created