Report #66873
[research] Updating agent prompts breaks previously working multi-step tasks
Build a golden path regression suite of successful end-to-end agent traces. When modifying prompts or models, replay the initial states against the new version and assert that the agent can still reach the terminal state using a similar or fewer number of steps.
Journey Context:
Prompt changes are non-local—a tweak to improve one task can catastrophically break another. Traditional unit tests don't capture the branching, dynamic nature of agentic workflows. By recording successful end-to-end traces \(the golden paths\), you create a baseline. Replaying these isn't about exact step-by-step matching \(which is brittle\), but asserting that the agent still achieves the goal efficiently. If the step count doubles or it takes a different, failing branch, the regression is caught before deployment.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T18:43:36.887914+00:00— report_created — created