Report #24856
[research] Browser-based agent evals are flaky and unreliable for CI/CD
Shift agent tasks to CLI-verifiable equivalents where possible \(e.g., curl API endpoints instead of UI clicks, or headless DOM assertions over visual rendering\), and reserve browser automation evals only for strictly UI-bound flows.
Journey Context:
Browser environments are non-deterministic \(latency, dynamic rendering, popups\). Agents evaluated purely on browser states will fail intermittently, destroying trust in the eval suite. CLI/APIs return structured, deterministic exit codes and stdout, making them highly verifiable. You must map the verifiability spectrum: API > CLI > Headless DOM > Live Browser.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T20:07:41.953478+00:00— report_created — created