Report #11314
[research] Agent regression tests are flaky because LLM outputs are non-deterministic, causing CI pipelines to fail randomly
Replace strict deterministic assertions \(assert == 'X'\) with statistical pass rates \(e.g., require 4 out of 5 runs to pass\) and semantic similarity thresholds in CI.
Journey Context:
Treating an LLM agent like a deterministic software module is a category error. Temperature, top-p, and model updates cause variance. If you assert exact string matches, CI will constantly break, leading to alert fatigue and developers ignoring the tests. Statistical testing accepts the inherent stochasticity while still catching regressions \(e.g., a drop from 90% pass rate to 20%\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T13:06:36.702785+00:00— report_created — created