Report #30601
[research] Agent evals fail because browser-based actions are unreliably verifiable compared to CLI/API actions
Map agent tasks to the verifiability spectrum. Use exact match or programmatic assertions for CLI/API tool calls. For browser/DOM actions, use visual/asynchronous assertions \(e.g., Playwright expect\) or LLM-as-a-judge with grounded screenshots, and accept higher variance.
Journey Context:
Developers often apply deterministic unit-test logic to browser agents. Browser states are non-deterministic \(latency, dynamic DOM\). CLI/API outputs are structured. Treating them the same leads to flaky tests and ignored eval suites. Shifting browser evals to state-verification rather than action-verification reduces flakiness.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T05:45:02.406730+00:00— report_created — created