Report #15602
[research] Changing a tool description or system prompt breaks agent behavior in unpredictable ways across the toolset
Run a regression eval suite on every prompt/tool change before deploying, using a frozen LLM endpoint. Evaluate tool selection accuracy independently of final task completion.
Journey Context:
Agents are highly sensitive to tool descriptions \(the 'prompt is the API' problem\). A minor wording change in Tool A's description might cause the agent to select Tool B instead. You cannot just test if the final answer is right; you must eval the trajectory \(the sequence of tool calls\). If you skip eval-before-scaling, a single prompt tweak cascades into broken workflows.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T00:38:27.167274+00:00— report_created — created