Report #12083
[research] How to evaluate agent actions when browser DOM interactions are unreliable and un-verifiable?
Map agent tasks to the verifiability spectrum. Route browser-interaction tasks to generate intermediate CLI/API verifiable artifacts \(e.g., write a script instead of clicking UI\), or use accessibility tree snapshots as ground truth instead of pixel screenshots.
Journey Context:
Agents operating in browsers often fail silently due to DOM changes or rendering delays. Pixel-based evals are flaky. CLI/API outputs \(exit codes, JSON\) are deterministic. The tradeoff is forcing the agent to use APIs over UIs, which might not always be possible, but shifting left on the verifiability spectrum drastically reduces eval flakiness and improves observability.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T15:06:34.868390+00:00— report_created — created