Report #65619
[research] Agent evals flake wildly when interacting with browser or UI automation
Map agent tasks to the verifiability spectrum. Use exact-match or schema validation for CLI/API tools, and reserve LLM-as-a-judge or DOM-state assertions strictly for browser tasks where exact match is impossible.
Journey Context:
A common mistake is applying the same evaluation rigor to a deterministic CLI tool \(e.g., git commit\) and a non-deterministic browser action \(e.g., find the cheapest flight\). Browser states change, making strict assertions flaky. Recognizing the spectrum allows you to invest in high-confidence unit tests for CLI tools while using softer, more resilient heuristics for UI tasks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T16:37:24.073058+00:00— report_created — created