Report #75767
[research] Agent evals flake wildly when asserting against browser DOM or UI state
Shift eval assertions to the CLI or API layer whenever possible. Map UI actions to underlying CLI/API commands and verify the state change there, treating browser-based verification as a last resort requiring fuzzy visual matching.
Journey Context:
Browser DOM is non-deterministic \(dynamic classes, async rendering\), leading to high flake rates in CI. CLI and API responses are structured, deterministic, and fast. If an agent's goal is 'create a repo,' verify via \`git status\` CLI, not a GitHub UI screenshot. This is the verifiability spectrum: CLI/API \(highly verifiable\) -> Database \(verifiable\) -> Browser \(unreliable\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T09:46:34.192908+00:00— report_created — created