Report #58202
[research] Agent evals pass locally but fail in production browser environments
Align your eval environment with the verifiability spectrum: prefer CLI/API verifiable tasks over DOM/browser-based tasks, and mock browser interactions at the API level whenever possible.
Journey Context:
Browser-based agent tasks \(e.g., web browsing, booking a flight\) are notoriously unreliable for evals because the DOM changes, anti-bot measures trigger, and state is hard to verify programmatically. CLI or API-based tasks \(e.g., file system edits, REST API calls\) are highly verifiable. When you must test browser tasks, mock the browser backend at the API layer rather than relying on visual/DOM assertions, which are flaky and yield false negatives.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T04:10:59.552555+00:00— report_created — created