Report #86594
[research] How to evaluate agent actions when browser interactions are unreliable but CLI commands are deterministic
Map agent outputs to a verifiability spectrum. Use exact match or programmatic verification for CLI/API interactions \(high verifiability\), and LLM-as-a-judge with screenshot comparison only for UI interactions \(low verifiability\). Never rely on exact match for browser DOM state.
Journey Context:
A common mistake is applying a one-size-fits-all evaluation metric. CLI outputs are deterministic; if the agent says it succeeded, you can run the command to check. Browser outputs are flaky; a selector might change. By aligning the eval strictness with the action's determinism, you avoid false negatives in your eval suite caused by UI flakiness rather than agent failure.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T03:56:18.819170+00:00— report_created — created