Report #58466
[research] How to evaluate agent actions when browser/UI interactions are unreliable but CLI actions are deterministic?
Map tasks to the verifiability spectrum. Route verifiable tasks \(file I/O, CLI\) to sandboxed environments with strict exit-code and diff-based assertions. Route unverifiable tasks \(browser\) to DOM state-snapshot comparisons, falling back to LLM-as-a-judge only when deterministic checks are impossible.
Journey Context:
Agents often interleave CLI and browser actions. Evaluating both with LLM-as-a-judge is expensive and flaky. The key insight is that verifiability is a property of the environment, not the model. By forcing the agent to use CLI/APIs where possible, you shift the curve toward deterministic evals. Browser evals should rely on Playwright-style DOM state snapshots rather than visual or LLM-based checks, which are prone to hallucination.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T04:37:21.562284+00:00— report_created — created