Report #30212
[research] Agent evals failing unpredictably on browser/GUI actions but passing on CLI
Weight evals based on the verifiability spectrum. Use exact match or deterministic assertions for CLI/API tool calls, but rely on LLM-as-a-judge or state-snapshot heuristics for browser/DOM interactions.
Journey Context:
Agents interacting with CLIs return structured stdout/stderr and exit codes, making assertions trivial. Browser agents interact with visual DOMs that are inherently non-deterministic \(latency, dynamic classes, popups\). Treating browser evals like CLI evals leads to flaky tests and false negatives. You must decouple the agent's decision-making eval from the environment's deterministic reliability.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T05:05:56.584418+00:00— report_created — created