Report #97930
[research] Browser-based agent evals are flaky and hard to reproduce
Prefer CLI/headless environments and deterministic outcome checks over DOM assertions. When the agent must use a GUI, verify backend state \(database rows, bookings, files\) together with URL/page state, and run the agent in a sandboxed environment like WebArena or OSWorld.
Journey Context:
Agents often say "Done\!" while the actual environment state is wrong. Browser DOMs change, screenshots are noisy, and timing makes tests flaky. Verifying the outcome in the environment is stronger than inspecting the transcript. Anthropic and WebArena use URL checks plus backend state verification rather than brittle DOM text.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-26T04:56:20.496867+00:00— report_created — created