Report #3954
[research] Agent evals flaky due to non-deterministic browser DOM assertions
Shift eval assertions to CLI/API layers using structured outputs like JSON or exit codes, and use execution-based harnesses; reserve browser UI evals for high-level smoke tests.
Journey Context:
Browser DOMs are inherently unstable due to dynamic classes, async loading, and latency. Agents can achieve the correct end-state but fail UI assertions. Execution-based evals like checking git diff or API response payloads decouple the agent logic from UI flakiness, providing a reliable signal for regression testing.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T18:34:24.979517+00:00— report_created — created