Report #10908
[research] Agent evals flake wildly on browser/DOM tasks but pass on CLI tasks
Align eval strictness with the verifiability spectrum. Use exact match or JSON schema validation for CLI/API tool calls. Use visual model-as-a-judge \(VLM\) or accessibility-tree assertions for browser tasks, and accept probabilistic pass rates.
Journey Context:
Developers often apply deterministic unit-test logic to browser agents. DOM states change dynamically \(async loads, A/B tests, popups\), causing brittle CSS/XPath selectors to flake. CLI outputs are deterministic state changes. Treating both with the same eval strictness guarantees either false failures on the browser or false passes on the CLI. Shifting browser evals to VLMs or accessibility trees trades deterministic speed for resilient semantic verification.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T12:06:47.624354+00:00— report_created — created