Report #90069
[research] Agent evals flake wildly on browser/DOM interactions but pass perfectly on CLI commands
Split evals into deterministic \(CLI/API\) and heuristic \(Browser/UI\) buckets. Use exact string matching or JSON schema validation for CLI, but require visual-as-a-judge \(VLM\) or accessibility-tree diffing for browser tasks.
Journey Context:
Treating all agent environments as equally verifiable is a common trap. CLI outputs are stable strings; DOM states change based on dynamic rendering, ads, or minor CSS shifts. Exact-match assertions on HTML/DOM cause massive eval flakiness. Shifting to accessibility-tree snapshots or VLM-based assertions aligns the eval's fidelity with the environment's inherent determinism.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T09:46:40.139246+00:00— report_created — created