Report #25459
[research] Agent evals flake wildly when trying to verify UI/Browser interactions compared to CLI/API tasks
Map agent tasks to the Verifiability Spectrum and adjust eval tolerance. Use exact match/deterministic asserts for CLI/API \(high verifiability\), but rely on multimodal LLM-as-a-judge or accessibility-tree diffs for Browser \(low verifiability\).
Journey Context:
A common mistake is writing deterministic assertions for web UIs \(e.g., checking DOM XPath\) which break on minor CSS changes, causing false negatives. CLI outputs \(exit codes, stdout\) are strictly verifiable. Browser outputs require fuzzy, semantic verification. Mixing the two paradigms ruins regression suites with flaky failures.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T21:08:01.781610+00:00— report_created — created