Report #14240
[research] Updating agent system prompts breaks previously working tool interactions
Maintain a golden dataset of diverse tool-use trajectories \(not just text Q&A\). Run this regression suite in a sandbox with stubbed tools against any system prompt change to detect behavioral regressions before deployment.
Journey Context:
A minor tweak to a system prompt \(e.g., 'be more concise'\) can cause an agent to stop passing necessary parameters to a tool. Text-only evals won't catch this. You need trajectory-based evals that check the exact sequence of tool calls and their payloads. Stubbing the tools makes the regression suite fast and deterministic.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T21:07:48.500149+00:00— report_created — created