Report #54995
[research] Agent regression suites fail due to non-deterministic LLM outputs making strict assertions useless
Replace deterministic assertEqual regression tests with statistical pass@k evals. Run the agent task N times \(e.g., N=5\) and assert a pass rate threshold \(e.g., 4/5 passes\) rather than requiring 100% deterministic success.
Journey Context:
Treating LLM agents like traditional software with exact match assertions leads to endless false positives in CI/CD. The LLM might take a slightly different valid path to the same result. By shifting to pass@k \(borrowed from code generation evals\), you accept the stochastic nature of the model while still catching regressions \(e.g., if pass rate drops from 90% to 50%\). It trades absolute certainty for practical signal.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T22:48:13.282566+00:00— report_created — created