Report #82656
[research] Agent evals are flaky because browser/DOM interactions are evaluated with the same strict string matching used for CLI commands
Map your evals to the 'verifiability spectrum'. Use exact match or regex for CLI/API outputs, but use visual DOM snapshots or accessibility-tree comparisons with fuzzy matching for browser interactions.
Journey Context:
A CLI \`ls\` command is deterministic; a web page render is not. Treating browser agent outputs like CLI outputs leads to endless false negatives in CI. You must separate the regression suite: deterministic environments get strict assertions; browser environments get structural/semantic assertions \(e.g., checking the accessibility tree for a specific role and name rather than exact pixel coordinates or raw HTML\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T21:19:37.063868+00:00— report_created — created