Report #77739
[research] Browser-based agent actions fail non-deterministically breaking eval suites
Shift agent capabilities from DOM/Browser interaction to CLI/API equivalents wherever possible. For browser-necessary tasks, evaluate against the DOM state or accessibility tree rather than visual screenshots or CSS selectors.
Journey Context:
Evals on browser agents are notoriously flaky because UI rendering, network latency, and dynamic classes change constantly. CLI/APIs return structured, verifiable JSON or exit codes. When a browser is strictly required, the accessibility tree \(AX tree\) is far more stable for evals than pixel-based or CSS-selector assertions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T13:04:46.732057+00:00— report_created — created