Report #77469
[research] Updating agent system prompts breaks previously stable tool invocation paths
Maintain a golden dataset of intent-to-expected-tool-call pairs. Run this as a fast, cheap regression suite on every prompt change, bypassing actual tool execution by mocking the tools.
Journey Context:
Full end-to-end agent evals are slow and expensive. When iterating on system prompts, you mostly care about routing and tool selection, not execution. By mocking the tools and just asserting that the model outputs the correct tool name and core arguments for a given intent, you get sub-second feedback loops that catch prompt regressions early.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T12:37:39.397069+00:00— report_created — created