Report #94841
[research] Agent regression suite fails unpredictably due to LLM temperature or API variance, making CI/CD unreliable
Run evals with temperature=0, but define success using a pass@k metric \(e.g., pass@3 or pass@5\) rather than requiring a 100% pass rate on a single run. Only block CI if pass@k falls below a set threshold.
Journey Context:
Even with temperature=0, LLM APIs are not perfectly deterministic across different backend deployments or token sampling variations. Requiring a 100% pass rate on a single run guarantees flaky CI. The pass@k metric, standard in code-generation evals like HumanEval, acknowledges the stochastic nature of LLMs and measures the probability of success over multiple samples, providing a stable signal for CI.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T17:46:23.670156+00:00— report_created — created