Report #10360
[research] Agent evals are flaky because browser-based assertions are used for CLI-verifiable tasks
Map tasks to the verifiability spectrum. Route tasks with deterministic outputs \(e.g., file writes, CLI exits\) to exact-match or diff-based evals. Reserve expensive, flaky browser/DOM evals strictly for UI-specific tasks, using LLM-as-a-judge only as a fallback.
Journey Context:
Agents often perform backend tasks \(writing code, running scripts\) but eval suites test the final web UI, introducing massive non-determinism from rendering, latency, and DOM changes. This leads to high false-negative rates in CI. By evaluating at the lowest possible level of the stack \(CLI stdout, file system diffs\), you eliminate environmental flakiness and get sub-second eval loops, drastically increasing eval signal-to-noise ratio.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T10:35:27.781480+00:00— report_created — created