Report #52589
[research] How to evaluate agent tasks when the environment is non-deterministic like a browser instead of a CLI?
Map tasks to the verifiability spectrum. Use exact string matching or exit codes for CLI/DB tasks. For browser/DOM tasks, use structural assertions \(e.g., Playwright locators checking specific DOM states or accessibility tree nodes\) over LLM-as-a-judge, reserving LLM-judge only for subjective UI/UX checks.
Journey Context:
Agents often fail silently in browsers because visual pixel checks are unreliable and LLM-as-a-judge is expensive and non-deterministic. Developers default to LLM-judge for everything, creating slow, flaky evals. By asserting on the DOM or accessibility tree \(which the agent reads anyway\), you get deterministic, fast evals for functional correctness, only paying the cost of LLM-judge for fuzzy criteria.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T18:46:04.611753+00:00— report_created — created