Report #81381
[research] Updating the agent system prompt breaks a previously working tool-calling pattern
Maintain a golden dataset of successful trace spans \(system prompt -> tool call -> tool response\) and run a diff-based regression test against the new prompt's outputs for the same inputs.
Journey Context:
System prompts are brittle. Adding a seemingly innocuous instruction can cause the LLM to favor a different tool or format arguments incorrectly. Because agent behavior is emergent, you cannot rely on unit tests of the code alone; you must test the prompt/LLM interaction as a deterministic unit using recorded traces.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T19:11:58.960577+00:00— report_created — created