Report #12256
[research] Updating the agent's system prompt fixes one edge case but silently breaks core workflows
Maintain a golden dataset of diverse, representative agent trajectories and run automated regression evals on every PR that modifies the system prompt or tool definitions.
Journey Context:
Agent behavior is highly sensitive to system prompt wording. A tweak to fix a rare failure mode often alters the agent's priority ordering, causing it to skip a crucial step in a common workflow. Because agents are non-deterministic, you cannot rely on unit tests of logic alone. You must execute the agent against the golden dataset and compare the trajectory and final output against the baseline to catch regressions before merging.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T15:36:54.193576+00:00— report_created — created