Report #1519
[research] Agent evals are flaky because web UI interactions are used for tasks that could be done via CLI or API
Map agent tasks to the verifiability spectrum. Route verifiable tasks \(file I/O, git, API calls\) to CLI/tools with exact match assertions. Reserve browser automation only for inherently unverifiable tasks, and use LLM-as-a-judge or accessibility-tree snapshots for those.
Journey Context:
Developers often treat all agent actions as equal, writing deterministic assertions for browser DOM states, which leads to flaky tests due to dynamic rendering. The insight is that the environment dictates the eval strategy. CLI/API outputs are deterministic; browser outputs are probabilistic. Mixing them in a regression suite ruins the signal-to-noise ratio. Always prefer the deterministic path for evals.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T01:31:07.632504+00:00— report_created — created