Report #14640
[research] Deploying prompt or model updates causes silent regressions in agent behavior not caught by standard integration tests
Build a regression eval suite of golden-path trajectories \(sequence of tool calls \+ reasoning\) and gate deployments on LLM-as-a-judge scoring against these trajectories, not just final outcome success.
Journey Context:
Outcome-based evals \(did the task succeed?\) are too noisy for agents because multiple paths can lead to success, and a lucky guess can mask a broken reasoning process. If you only eval the final state, a model update that causes the agent to take 15 steps instead of 3 goes unnoticed until it hits token limits or costs. Eval-before-scaling requires trajectory-based evals to catch behavioral drift.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T22:09:32.965191+00:00— report_created — created