Report #9766
[research] Agent regression tests are too flaky to block CI
Use statistical pass rates \(e.g., pass@5 or pass@10\) over multiple runs instead of single-shot deterministic assertions for agent regression suites.
Journey Context:
Setting temperature to 0 does not guarantee determinism in LLMs due to floating-point operations and backend routing. Single-shot assertions in CI will inevitably flake. The industry standard shift is to treat agent evals like probabilistic systems: run the test N times and require an M/N pass rate. This trades CI speed for reliability, but it is the only way to accurately catch regressions without constant false positives.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T09:06:30.652460+00:00— report_created — created