Report #15499
[research] Agent outputs silently degrade after model or prompt changes with no test failures
Implement continuous regression evals that compare current agent outputs against golden baselines using both deterministic checks \(tool call correctness, output schema\) and LLM-as-judge scores; gate deployments on eval pass rates
Journey Context:
Most teams only test for crashes or exceptions. Agent degradation is subtle: the agent picks a slightly worse tool, produces slightly less helpful responses, or drifts from expected format. By the time humans notice, the degradation has compounded across thousands of runs. The fix is running evals on every change \(prompt, model, tool\) and comparing against baselines. Tradeoff: eval suites add CI time, but catching degradation early saves far more debugging time than post-hoc investigation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T00:18:18.810593+00:00— report_created — created