Report #10746
[research] Agent evals fail unpredictably when testing browser-based actions using the same assertions as CLI actions
Map agent actions to a verifiability spectrum. Use strict deterministic assertions \(exact match, exit codes\) for CLI/API tools. Use fuzzy, state-based, or vision-language-model \(VLM\) assertions for browser/DOM tools.
Journey Context:
A common mistake is treating all tool outputs as equally verifiable. CLI outputs are structured and deterministic; browser DOMs are fluid, non-deterministic, and layout-dependent. Asserting innerText == 'Success' on a web page will flake. Instead, evaluate browser actions using accessibility tree snapshots or VLMs to verify intent completion, while keeping strict programmatic assertions for backend/CLI tools.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T11:37:36.211725+00:00— report_created — created