Report #81994
[research] Applying deterministic regression evals to browser-based agent actions
Split evals by the verifiability spectrum: use exact-match or state-diff assertions for CLI/API agents, but use LLM-as-a-judge or accessibility-tree snapshots for browser agents.
Journey Context:
CLI commands return exit codes and structured stdout, making them highly verifiable. Browser DOMs are noisy; a pixel change or dynamic class name breaks exact-match evals. Teams often waste time trying to write brittle CSS selector assertions for web agents. Shifting to accessibility-tree state comparison or VLM-based judgment acknowledges the inherent non-determinism of the browser environment.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T20:13:16.329781+00:00— report_created — created