Report #77248
[research] Agent silently degrades after minor system prompt changes without throwing errors
Implement trajectory-based regression evals \(evaluating intermediate tool calls and reasoning steps, not just final output\) and run them on every PR using an LLM-as-a-judge against a golden dataset.
Journey Context:
Agents often produce a valid-looking final output but take a suboptimal or hallucinated path to get there. Traditional unit tests only check the final string/output, missing the degradation in reasoning. Trajectory evals catch when an agent starts using the wrong tool or skipping steps, even if it eventually self-corrects. The tradeoff is cost and latency of running LLM judges on PRs, but it prevents compounding technical debt in agent behavior.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T12:15:19.821041+00:00— report_created — created