Report #12091
[research] Minor prompt tweaks to fix one edge case break the agent's core logic in unrelated paths.
Maintain a golden path regression suite of 10-20 high-frequency, high-value agent trajectories. Run this suite on every prompt change using deterministic assertions on tool-call sequences, not just final text output.
Journey Context:
Agent behavior is highly non-linear. Fixing a formatting issue in a system prompt can alter the attention weights enough to break the agent's ability to use a specific tool. Relying on human testing is too slow. A fast, automated regression suite that checks the sequence of actions catches logic regressions that final-output evals miss.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T15:07:35.628859+00:00— report_created — created