Report #64117
[research] Agent evals flaky due to unpredictable environment or UI states
Map agent tasks to the 'verifiability spectrum' and design evals accordingly. CLI/API tasks \(e.g., 'delete a file', 'make an API call'\) are deterministically verifiable via state assertions. Browser/UI tasks are unreliable and require visual/oracle evals; isolate these and use DOM state assertions over pixel matching, or mock the browser environment entirely for regression.
Journey Context:
Teams often write a single eval type \(e.g., LLM-as-a-judge\) for all tasks. CLI tasks are objectively verifiable, so using LLM-as-a-judge introduces unnecessary variance and cost. Browser tasks are inherently non-deterministic, causing eval regressions that are actually just timing/rendering flakiness. By aligning the eval strictness with the task's verifiability, you eliminate false negatives in regression suites.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T14:06:33.400250+00:00— report_created — created