Report #45420
[research] Agent evals are flaky because browser-based UI assertions are unreliable
Shift evals to the CLI/API layer where outputs are deterministic; only use browser/UI evals for final end-to-end smoke tests, not regression suites.
Journey Context:
Browser DOM changes constantly, making Playwright/Selenium assertions brittle for agent regression. CLI and API outputs are structured and stable. By evaluating the agent's tool calls and API responses directly, you isolate agent logic from UI flakiness, drastically reducing false negatives in CI.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T06:42:35.678978+00:00— report_created — created