Report #25280
[research] Treating a single agent run as a pass/fail indicator for a prompt change
Run eval tasks N times \(e.g., N=5 or N=10\) and compare the pass rate distribution. Use statistical tests \(like bootstrap resampling\) to determine if a change is an actual improvement or just noise.
Journey Context:
LLMs are stochastic. A prompt change might score 1/1 on a test, but regress to 3/10 on repeated runs. Conversely, a true improvement might fail once due to a sampling fluke. Single-run evals give a false sense of precision. Statistical evals acknowledge the variance and prevent you from shipping regressions based on lucky draws.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T20:50:26.769542+00:00— report_created — created