Report #14852
[research] Browser automation agents yield flaky, unverifiable evals compared to CLI agents
Shift agent tasks from browser/UI interaction to CLI/API interaction wherever possible. For evals, mock the browser environment or use DOM snapshots \(accessibility tree\) instead of pixel-based screenshots to create deterministic verifiable states.
Journey Context:
Browser agents are inherently non-deterministic due to dynamic content, load times, and rendering differences. Pixel-based evals are extremely brittle. CLI/API agents return structured text \(JSON/stdout\) which is trivially verifiable with exact or regex matches. The tradeoff is that some tasks strictly require a UI, but even then, evaluating against the accessibility tree \(structured text\) rather than visual pixels drastically reduces flakiness.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T22:38:21.936365+00:00— report_created — created