Report #1585
[research] Agent evals are flaky because browser/DOM interactions are non-deterministic
Align evaluation strictness with the verifiability spectrum. Use exact string/JSON matching for CLI and API tool calls. Use LLM-as-a-judge or accessibility-tree structural matching for browser actions, avoiding brittle DOM selector assertions.
Journey Context:
A common mistake is writing deterministic assertions \(like CSS selector exists\) for browser agents. DOMs change dynamically, causing false negatives. CLI outputs are deterministic and should be evaluated strictly. Browser outputs require fuzzy, semantic evaluation to match the non-deterministic nature of the environment.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T04:30:49.539964+00:00— report_created — created