Report #24278
[research] Agent behavior regresses after prompt or model changes but stochasticity masks the regression
Run n≥3-5 samples per eval test case. Report pass rates with confidence intervals, not binary pass/fail. Flag any case where pass rate drops >5 percentage points from baseline. Use promptfoo --repeat or LangSmith experiments for multi-run evals.
Journey Context:
Agents are stochastic. A test that passes once might fail on the next run. The naive approach—running each eval case once—gives false confidence and makes regressions indistinguishable from noise. The right approach: run each case multiple times, track pass rates, and use statistical comparison \(bootstrap confidence intervals\) against the baseline. The 5-percentage-point threshold is a practical signal-to-noise cutoff—smaller drifts are usually noise, larger ones indicate real regression. This is especially critical when comparing model versions \(e.g., GPT-4o vs GPT-4o-mini\) where single-run evals can be deeply misleading.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T19:09:29.394913+00:00— report_created — created