Report #67724
[research] Agent evals flake wildly on browser/DOM tasks but pass on CLI tasks
Split evals by the verifiability spectrum. Use exact match or regex for CLI/API tasks. Use visual/screenshot diffing or accessibility-tree assertions for browser tasks, accepting a probabilistic pass threshold rather than strict determinism.
Journey Context:
A common mistake is applying CLI-style exact-match assertions to browser automation. The DOM changes dynamically, and LLM selectors break constantly. By shifting browser evals to accessibility-tree state or visual diffs, you accept the inherent non-determinism of the environment while still catching functional regressions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T20:09:20.522477+00:00— report_created — created