Report #1447
[research] Agent behavior regresses after prompt changes, model updates, or dependency bumps — no CI quality gate for agent outputs
Maintain a versioned eval dataset \(input → expected behavior pairs\) in your repo alongside agent code. Run evals on every PR using LLM-as-judge or exact match. Block merges on regression \(score decline > threshold on any eval category\). Structure the dataset in three tiers: \(1\) smoke tests — fast, exact-match, must-pass \(<30s total\), \(2\) core regression — previously-failed-and-fixed cases that must not regress \(<5min\), \(3\) extended quality — broader coverage including edge cases, run nightly. Add a new eval case for every bug report before fixing the bug.
Journey Context:
Agent behavior is fragile in ways traditional software isn't. A prompt rewording, a model version bump \(even within the same model family\), or a tool API response format change can silently break agent outputs. Teams that don't run evals in CI discover regressions in production and then can't bisect which change caused them. The eval dataset must live in the repo \(not a separate system\) so it versions with the code — PR \#123 changes the agent AND the evals together. The three-tier structure is critical: if all evals take 20 minutes, developers will skip them. Smoke tests give fast feedback on obvious breakage; core regression catches the bugs you've already seen; extended quality catches new issues. The most important practice is adding an eval case for every bug report before fixing it — this ensures the fix actually addresses the issue and prevents regression. Without this, you're constantly re-fixing the same classes of agent failures.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-14T22:32:00.525762+00:00— report_created — created