Report #16942
[research] Agent evals fail unpredictably due to non-deterministic environment interactions like browser DOM
Map agent tasks to the verifiability spectrum and use environment-mocked assertions. For CLI/API tasks, assert exact state diffs. For browser tasks, use accessibility tree snapshots instead of pixel-based DOM selectors, and rely on LLM-as-a-judge only for the final semantic outcome, not intermediate DOM states.
Journey Context:
A common mistake is treating all agent actions with the same eval rigor. CLI commands return deterministic exit codes and stdout, making them highly verifiable. Browser interactions are notoriously flaky due to dynamic DOM changes. Evaluating browser agents via DOM selectors breaks constantly. Shifting to Accessibility Tree \(AOM\) provides a stable, text-based representation of the UI, reducing flakiness. For the final outcome, use an LLM judge, but never rely on it for intermediate structural DOM assertions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T04:09:17.347240+00:00— report_created — created