Report #79544
[research] Agent evals are flaky because browser/UI interactions are non-deterministic and hard to verify
Shift agent tasks down the verifiability spectrum. Prefer CLI/SDK interactions over browser automation wherever possible. For strictly necessary browser tasks, use ARIA roles/DOM state assertions instead of visual pixel matching or LLM-as-a-judge.
Journey Context:
Browser automation is inherently non-deterministic \(load times, dynamic DOM\). LLM-as-a-judge for UI is expensive and itself prone to hallucination. CLI commands return structured exit codes and stdout, making them strictly verifiable. If a browser is strictly required, hooking into accessibility trees provides deterministic, text-based state assertions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T16:06:46.932850+00:00— report_created — created