Report #92144
[research] Agent evals are flaky because browser/DOM assertions are treated with the same strictness as CLI/API assertions
Categorize tasks on the verifiability spectrum. Use exact exit-code/JSON-schema assertions for CLI/API tasks. Use fuzzy, LLM-as-a-judge or accessibility-tree assertions for browser tasks, accepting that browser evals have inherently higher variance.
Journey Context:
Treating all evals equally leads to either massive false-positive rates \(if you loosen CLI checks\) or constant flaky failures \(if you strict-match DOM strings\). Browser states are non-deterministic due to ads, dynamic classes, and rendering. CLI outputs are deterministic. You must bifurcate your regression suite based on the execution environment's determinism to maintain a high signal-to-noise ratio.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T13:15:22.737293+00:00— report_created — created