Report #95601
[research] Agent evals are flaky because browser and UI interactions are tested with the same strict string-matching assertions as CLI interactions
Map agent tasks to the verifiability spectrum. Use deterministic assertions \(exact match, JSON schema\) for CLI/API tasks. Use LLM-as-a-judge or visual diffing only for browser/UI tasks where outputs are non-deterministic.
Journey Context:
A common mistake is applying one evaluation strategy to all agent actions. CLI outputs are strict; DOM states are not. Treating a browser output as deterministic leads to brittle, flaky tests. Treating a CLI output as probabilistic wastes money on LLM-judges when a simple assert suffices.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T19:02:56.846340+00:00— report_created — created