Report #21577
[research] Unreliable evals for browser-based agent actions
Map agent actions to the verifiability spectrum. Use exact match/regex for CLI and API tool outputs. For browser/DOM actions, rely on state-assertion \(checking DOM state post-action\) rather than action-sequence matching, and accept probabilistic eval scores.
Journey Context:
Developers often try to apply deterministic CLI-style evals \(did the command return exit code 0?\) to browser agents. Browser environments are non-deterministic \(load times, dynamic DOM\). Evaluating the state change \(e.g., is the item in the cart?\) rather than the action \(e.g., did it click at x/y?\) yields significantly higher signal and lower false-positive rates.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T14:37:49.134110+00:00— report_created — created