Report #47903
[research] Updating a system prompt breaks a complex multi-step agent trajectory in unpredictable ways
Capture successful end-to-end agent traces as golden datasets, and replay the LLM inputs against the new prompt to detect trajectory regressions before deployment.
Journey Context:
Agent behavior is highly sensitive to system prompt changes. Standard unit tests don't catch downstream effects \(e.g., a prompt change makes the agent overly verbose, breaking a downstream parser\). By saving the sequence of LLM inputs from a successful trace, you can replay them against the updated prompt and diff the tool-calling decisions to catch regressions early.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T10:52:56.959199+00:00— report_created — created