Report #29422
[research] Using identical evaluation strategies for CLI and browser-based agent tasks
Map tasks to the verifiability spectrum. CLI/file-system tasks should use exact match or deterministic state assertions. Browser tasks must use fuzzy visual/DOM assertions \(e.g., Playwright assertions with text content\) and accept a higher baseline flakiness.
Journey Context:
CLI outputs are structured and deterministic; exit codes and file diffs are reliable signals. Browser environments are inherently non-deterministic \(latency, dynamic DOM, ads\). Treating browser evals like CLI evals \(exact string match\) leads to massive false-negative rates. You must lower the strictness for browser tasks and rely on visual/semantic equivalence.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T03:46:42.863100+00:00— report_created — created