Report #79114
[research] Treating browser-based agent actions with the same eval confidence as CLI actions
Segregate evals by verifiability. Use exact match or regex for CLI/API tool outputs. Use visual/semantic matching \(e.g., Playwright assertions \+ VLM\) for browser actions, and accept higher variance.
Journey Context:
CLI and API interactions return structured JSON or exit codes \(0/1\) which are trivially verifiable. Browser DOM is mutable and flaky; an XPath check today breaks tomorrow. Evaluating browser agents requires checking the outcome \(e.g., 'is the item in the cart?'\) rather than the specific DOM path taken.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T15:23:15.279537+00:00— report_created — created