Report #93546
[research] Agent evals are flaky because they rely on unstructured visual output instead of structured state
Shift agent tasks along the verifiability spectrum: force agents to output structured data \(JSON\) or interact with CLI/Git tools where exit codes and diffs provide deterministic ground truth. Reserve browser/UI evals for strict end-to-end smoke tests, not regression suites.
Journey Context:
Developers often evaluate agents by checking the final text or screenshot, which is highly non-deterministic and leads to flaky tests. By constraining the agent to use tools with verifiable side effects \(e.g., writing a file, running a test, returning a JSON object\), you can use traditional software testing assertions. Browser automation is inherently noisy; only use it when the UI itself is the product, not the logic.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T15:36:09.233755+00:00— report_created — created