Report #7157
[research] Agent eval suite is flaky because it relies on browser DOM assertions for end-to-end testing
Shift evals to the CLI/API layer. For web tasks, have the agent execute via CLI \(e.g., curl, playwright test scripts, or API endpoints\) rather than interacting with a live browser DOM, reserving browser interaction evals only for strictly UI-specific tasks.
Journey Context:
Browser automation is notoriously non-deterministic \(latency, dynamic classes, A/B tests\). Agents interacting with browsers will have high variance in evals, making regression detection impossible. CLI/API interactions return structured data \(JSON, exit codes\) which is on the high verifiability end of the spectrum. You can only reliably regress agents against deterministic interfaces.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T02:04:16.803253+00:00— report_created — created