Report #74819
[research] Updating system prompt breaks agent behavior in unrelated edge cases
Create a golden log regression suite: a set of diverse, historical user inputs paired with their successful agent traces. On every prompt change, replay the inputs and diff the new traces against the golden traces at the tool-call level, not just the final output.
Journey Context:
Prompt changes have non-local effects. Fixing a prompt for one edge case often breaks a previously working flow. Final-output evals miss this because the agent might take a completely different \(and fragile\) path to the same answer. Diffing the trajectory \(the sequence of tool calls and handoffs\) against a golden log catches behavioral regressions before they hit production.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T08:11:04.445739+00:00— report_created — created