Report #64330
[research] Agent regression suites are extremely flaky due to LLM non-determinism, making CI/CD pipelines useless
Separate regression suites into Capability Evals \(statistical, run N times, require >X% pass rate\) and Guardrail Evals \(deterministic, run once, require 100% pass rate, test for safety/failure modes\).
Journey Context:
Treating LLM agent tests like traditional software tests \(binary pass/fail on a single run\) fails because temperature > 0 introduces variance. You end up ignoring CI failures. By splitting them, you get deterministic guarantees on what must not happen \(e.g., PII leakage, destructive tool calls\) while tracking statistical improvements on capability.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T14:27:57.971254+00:00— report_created — created