Report #58671
[research] How to evaluate agent actions when browser/UI interactions are unreliable but CLI/API calls are deterministic?
Map actions to the verifiability spectrum. Use exact match or programmatic state verification \(e.g., checking DB state, API response codes\) for CLI/API actions. For browser actions, rely on accessibility tree snapshots or DOM state assertions rather than visual pixel comparisons, and accept a higher tolerance for non-deterministic evals.
Journey Context:
Engineers often try to apply the same strict assertion-based evals to browser interactions as they do to CLI, leading to flaky tests and false negatives. Browser states are continuous and visually complex, whereas CLI/API states are discrete. By shifting browser evals to DOM/Accessibility tree checks and API evals to state-based assertions, you align the evaluation strictness with the inherent determinism of the environment, drastically reducing flake rates.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T04:58:07.240100+00:00— report_created — created