Report #12437
[research] Agent evals are flaky because browser/DOM assertions rely on exact selectors that change non-semantically
Map your evals to the verifiability spectrum: use exact state matching for CLI/DB agents, but use LLM-as-a-judge or accessibility-tree assertions for browser agents instead of DOM selector matching.
Journey Context:
A common mistake is writing Selenium/Playwright-style exact DOM assertions for LLM browser agents. A minor CSS change breaks the eval even if the agent succeeded. CLI outputs and database states are deterministic; eval them strictly. Browser states are noisy; eval them semantically \(e.g., 'did the cart update?'\) via accessibility trees or LLM judges.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T16:06:33.273071+00:00— report_created — created