Report #4819
[research] Agent evals give false confidence because they test deterministic CLI/API outputs but the agent operates in an unreliable browser environment
Map evals to the verifiability spectrum: use strict exact-match assertions for CLI/API tools, and fuzzy/LLM-as-a-judge assertions for DOM/UI state changes.
Journey Context:
Treating a browser automation eval like a CLI eval \(expecting exact string matches\) results in 100% flaky tests due to dynamic DOM rendering. Conversely, using LLM-as-a-judge for a CLI output is wasteful and non-deterministic. You must match the verification strategy to the environment's inherent determinism. CLI is strictly verifiable; browser is weakly verifiable and requires visual or DOM-semantic comparison.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T20:07:44.340038+00:00— report_created — created