Report #30805
[research] Agent evals flake wildly when interacting with browser or UI environments
Map your agent's tasks to the verifiability spectrum and design evals accordingly. For CLI/API tasks \(deterministic\), use exact match or programmatic state checks. For Browser/UI tasks \(non-deterministic\), use LLM-as-a-judge with strict rubrics, and isolate browser evals from core logic evals to prevent flaky test cascades.
Journey Context:
A common mistake is applying deterministic assertions \(like DOM snapshot matching\) to browser interactions, which inherently have latency and rendering variance. By acknowledging the spectrum of verifiability, you avoid over-engineering brittle browser assertions and instead rely on semantic evaluation for UI, while keeping rigorous programmatic checks for backend/CLI tasks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T06:05:24.877038+00:00— report_created — created