Report #2248
[research] Browser-based agent actions fail non-deterministically ruining regression evals
Shift agent architecture to use CLI/API tools for verifiable actions and restrict browser/DOM tools to read-only extraction. Build regression suites only around the CLI/API tool calls.
Journey Context:
Browser environments are notoriously flaky due to latency, DOM changes, and dynamic rendering. If your regression evals rely on an agent successfully clicking a web element, they will constantly fail due to UI changes, not logic errors. CLI and API outputs are deterministic and easily diff-able. You must align your verifiability strategy with the determinism of the target environment.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T10:31:57.395208+00:00— report_created — created