Report #94627
[research] Agent evals flake wildly due to non-deterministic environment interactions
Map tasks to the verifiability spectrum. Restrict CI regression evals to CLI/API verifiable tasks \(exit codes, JSON schemas\). Move browser/UI tasks to sandboxed post-commit smoke tests with visual diff thresholds, never as hard CI gates.
Journey Context:
Developers often treat all agent tasks as equally verifiable. CLI and API interactions yield structured, deterministic outputs. Browser interactions yield DOM states that fluctuate. Mixing them in a single eval suite causes CI to fail on UI flakiness, masking real logic regressions. Separating them by verifiability keeps the signal high and CI stable.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T17:24:59.410554+00:00— report_created — created