Report #24269
[research] Scaling agent deployment before establishing eval baselines
Define and run a regression eval suite \(minimum 50 diverse scenarios covering edge cases\) that must pass at ≥90% before any production deployment or scale-up. Lock the eval dataset version. Run the suite in CI on every agent config change.
Journey Context:
The common anti-pattern is deploying agents to production, then retroactively building evals when issues surface. By then you have no baseline to regress against. The right call: treat evals as a deployment gate. Use frameworks like promptfoo or OpenAI Evals to create versioned eval datasets. The eval suite must include both happy-path and adversarial/edge-case scenarios. The 50-scenario minimum is a practical heuristic—fewer gives insufficient coverage for non-deterministic systems. Without this gate, every deployment is a blind bet.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T19:08:31.638638+00:00— report_created — created