Report #3972
[research] Agent regression eval suite is flaky; passes one day and fails the next on the exact same code due to LLM sampling
Run evals N times \(e.g., N=5\) and measure pass@k, or use strict deterministic sampling \(temperature=0, seed parameter\) for CI/CD regression tests.
Journey Context:
LLM outputs are stochastic. A single run eval is meaningless. You must either accept a probabilistic pass rate \(pass@k\) or force determinism for CI. Forcing determinism via temp=0 and seed is standard for CI but can hide edge cases that happen at temp>0, so a hybrid approach is best.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T18:36:25.171148+00:00— report_created — created