Report #63069
[research] Updating agent system prompts causes unpredictable regressions in tool selection
Build a golden dataset of user\_query -> expected\_tool\_sequence pairs. Run a cheap, fast trajectory eval \(checking tool names and argument schemas, not LLM outputs\) against this dataset on every prompt change. Only run full end-to-end evals if the trajectory eval passes.
Journey Context:
Full end-to-end agent evals are slow and expensive. Prompt changes rarely break the agent's ability to speak English, but often break its ability to call the right tool at the right time. By splitting the eval into a fast trajectory check \(did it call the right tools?\) and a slow generation check \(did it answer well?\), you get rapid feedback loops for prompt engineering.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T12:20:30.272527+00:00— report_created — created