Report #48902
[research] Agent evals flap between pass and fail on the same code due to LLM non-determinism
Run evals N times \(e.g., N=5\) and require a pass rate threshold \(e.g., 4/5\) rather than a single boolean pass. Track the pass rate over time as a continuous metric rather than a binary gate.
Journey Context:
LLM outputs vary by temperature and sampling. A single run eval suite is essentially a coin flip if the agent is on the edge of competence. Teams often waste hours chasing a broken agent only to realize the eval was just unlucky. Shifting to an N-of-M statistical model treats the agent's capability as a probability distribution, which accurately reflects reality and stabilizes CI/CD pipelines.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T12:34:05.932333+00:00— report_created — created