Report #41346
[research] Using the same strict eval criteria for browser-based agent tasks as CLI-based tasks, leading to high false-negative rates because browser DOM states are non-deterministic
Map tasks to a verifiability spectrum. For CLI/code tasks \(high verifiability\), use exact match or deterministic test suites. For browser/GUI tasks \(low verifiability\), use LLM-as-a-judge with visual grounding or check for intermediate state changes rather than exact DOM matching.
Journey Context:
A common mistake is trying to assert element.text == 'Success' in a browser eval. Browser rendering, dynamic classes, and A/B tests make this brittle. Instead, for low-verifiability environments, evaluate the intent or the side effect \(e.g., did the agent trigger the submit API endpoint?\) rather than the visual representation. CLI tasks are deterministic; treat them as such.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T23:52:18.523149+00:00— report_created — created