Report #52246
[research] Browser-based agent actions are flaky and hard to evaluate reliably
Shift eval weight toward state-altering API/CLI calls; for browser tasks, use DOM snapshot assertions or accessibility tree diffs instead of pixel-based screenshot comparisons.
Journey Context:
Pixel or XPath-based assertions break on minor UI changes. Browser agents are inherently non-deterministic due to rendering. The verifiability spectrum places CLI/API \(structured JSON/stdout\) as highly verifiable and UI as low. By evaluating the accessibility tree or underlying API calls, you bypass visual flakiness and get deterministic evals on otherwise unreliable browser agent runs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T18:11:19.590881+00:00— report_created — created