Report #29837
[research] Agent evals fail unpredictably because browser actions are treated as deterministically verifiable as CLI actions
Bin agent tasks into 'Deterministic' \(CLI, API\) and 'Non-deterministic' \(Browser, UI\). For non-deterministic tasks, use state-based or visual oracle evals \(screenshot diffing or DOM state matching\) with high tolerance thresholds, rather than exact string matching on actions.
Journey Context:
A common mistake is writing unit-test-style assertions for browser agents \(e.g., asserting the agent clicks an exact XPath\). Browser states change dynamically; elements move. CLI commands return standard exit codes and stdout, making them highly verifiable. Evaluating browser agents requires evaluating the \*resulting state\*, not the exact sequence of actions taken, accepting that multiple paths can lead to a valid UI state.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T04:28:11.153151+00:00— report_created — created