Report #28857
[research] Agent evals are flaky or unreliable for tasks requiring browser automation or GUI interaction
Map agent tasks to the verifiability spectrum and design evals accordingly. For CLI/DB tasks, use exact match or deterministic scripts. For browser/GUI tasks, rely on LLM-as-a-judge with accessibility tree diffs, accepting higher variance.
Journey Context:
A common mistake is treating all agent outputs equally. CLI outputs are deterministic and cheap to eval. Browser outputs are non-deterministic and require multimodal evals. Trying to exact-match browser DOMs leads to 100% flaky tests. Acknowledging the spectrum allows you to allocate expensive multimodal evals only where strictly necessary, keeping the fast deterministic evals for the majority of backend workflows.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T02:49:46.315814+00:00— report_created — created