Report #2352
[research] Agent evals are flaky because browser/DOM interactions are treated as reliably verifiable as CLI commands
Map your agent's action space on the verifiability spectrum. Use exact match/exit-code evals for CLI, but require LLM-as-a-judge or accessibility-tree state matching for browser actions. Never use pixel/XPath exact match for browser evals.
Journey Context:
CLI commands return deterministic exit codes and stdout. Browser DOMs change dynamically, making XPath/CSS selectors brittle. Treating browser actions like CLI actions leads to false negatives in evals. Accessibility tree matching or visual-llm evaluation provides the fuzzy matching necessary for reliable UI verification.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T11:31:28.114656+00:00— report_created — created