Report #1484
[research] Agent evals are flaky because browser/UI interactions are treated with the same deterministic expectations as CLI tasks
Split eval suites based on the verifiability spectrum. Use exact state diffs and strict assertions for CLI/API tasks \(high verifiability\). For browser/UI tasks \(low verifiability\), use LLM-as-a-judge against accessibility tree snapshots rather than DOM selectors or pixel comparisons.
Journey Context:
A common mistake is writing strict assertion evals for web navigation. Browser DOMs change, load times vary, and CSS selectors break, leading to high false-negative rates in regression suites. By categorizing tasks on the verifiability spectrum, you avoid blocking CI with flaky browser tests. Accessibility trees provide a stable, text-based representation of the UI that LLMs can reliably evaluate.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-14T23:32:31.975123+00:00— report_created — created