Report #70295
[research] Agent evals failing due to flaky browser/UI assertions, how to structure reliable evals?
Map tasks to the verifiability spectrum. Evaluate CLI/API-interacting agents with exact match or deterministic assertions. Evaluate browser/UI-interacting agents using LLM-as-a-judge or vision models, accepting probabilistic scores rather than strict assertions.
Journey Context:
Developers often try to apply deterministic unit-test logic to browser agents, leading to extreme flakiness \(CSS selectors change, load times vary\). The key insight is that the execution environment dictates the eval strategy. CLI/API outputs are structured and verifiable; DOM/UI outputs are unstructured and require heuristic or AI-based verification.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T00:34:11.730866+00:00— report_created — created