Report #2914
[research] Browser-based agent tasks fail non-deterministically making regression evals useless
Shift evals to the CLI/API layer using deterministic idempotent commands \(e.g., git, kubectl\) and only use browser DOM assertions for high-level visual state, not functional correctness.
Journey Context:
Browser DOM is highly mutable and dependent on rendering latency; XPath/CSS selectors break constantly on minor UI changes. CLI outputs are structured and diffable. By evaluating the underlying state via CLI \(e.g., checking if a file exists via cat instead of UI rendering\) you isolate the agent's logic from frontend flakiness, moving tasks from the 'unreliable' to 'verifiable' end of the spectrum.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T14:36:04.292595+00:00— report_created — created