Report #1581
[research] Deploying agent prompt changes causes regressions in edge cases not covered by unit tests
Run a lightweight, statistical regression eval suite \(e.g., 50-100 representative tasks\) on every prompt/logic change before deploying. Require a >90% pass@2 rate rather than 100% pass@1 to account for LLM non-determinism without blocking deployments.
Journey Context:
LLMs are non-deterministic. Traditional CI/CD expects 100% pass rates. If you enforce 100% pass@1 for agent evals, you will constantly block deployments due to random LLM variance. If you skip evals, you ship breaking changes. The solution is a statistical approach: a small, highly representative golden dataset evaluated multiple times \(pass@k\) to distinguish systemic regressions from random sampling noise.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T03:31:37.550214+00:00— report_created — created