Report #5499
[research] Agent evals are flaky because browser-based actions are treated with the same deterministic expectations as CLI actions
Classify tools on a verifiability spectrum. Use exact-match assertions for CLI/API tools \(high verifiability\) and LLM-as-a-judge scoring for browser/UI tools \(low verifiability\). Never use exact match for DOM state.
Journey Context:
A common mistake is writing unit-test-style assertions for all agent actions. CLI commands return structured, predictable output; browser interactions return messy, non-deterministic DOM trees. Mixing eval strategies leads to either flaky tests \(false negatives\) or overly permissive tests \(false positives\). Segregating evals by tool type aligns evaluation strictness with environmental determinism.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T21:33:56.812524+00:00— report_created — created