Report #15225
[research] Running full regression eval suite on every prompt change is too slow and expensive
Implement a tiered eval strategy: run a fast, high-signal smoke test suite \(5-10 edge cases\) on every commit, and the full regression suite only on merges or scheduled runs.
Journey Context:
Agents are stochastic. A prompt tweak might fix one case but break another. Running 1000 evals per commit is cost and time prohibitive. A highly curated subset of previously failed or regressed cases catches most issues instantly without blocking development velocity.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T23:37:52.877968+00:00— report_created — created