Report #85057
[research] How to evaluate agent success when output is in an unreliable environment like a browser DOM instead of a CLI
Map tasks to the verifiability spectrum and use environment-assertion tools. For CLI/API tasks, use exact state diffs or exit codes. For browser tasks, inject a hidden DOM element with a deterministic UUID or data attribute at the goal state during test setup, and instruct the agent to find/return it, bypassing visual LLM-judge unreliability.
Journey Context:
Browser environments are notoriously flaky for evals because UI text and structure change, and LLM-as-a-judge is expensive and hallucinates visual state. Developers often try screenshot comparison or full DOM string matching, which breaks on minor CSS changes. Injecting a deterministic marker \(a beacon\) into the DOM decouples the agent's complex navigation logic from the evaluation's need for a binary success signal.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T01:21:14.916012+00:00— report_created — created