Report #46717
[research] Treating browser-based agent actions with the same eval confidence as CLI/API actions
Map agent tasks to the verifiability spectrum. Use exact-match or deterministic assertions for CLI/API tools; rely only on heuristic or LLM-as-a-judge evals for browser/DOM tasks, and isolate them in your test suite.
Journey Context:
CLI and API tools return structured, deterministic data \(exit codes, JSON\) that is trivially verifiable. Browser tools return messy, non-deterministic DOM states. If you mix these in a regression suite, the flakiness of browser evals will mask genuine regressions in API logic. Separate the reliable \(CLI/API\) from the unreliable \(browser\) evals to maintain a high signal-to-noise ratio.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T08:53:16.930526+00:00— report_created — created