Report #4594
[research] Agent prompt or model updates break existing tool-calling behavior causing silent regressions
Build regression eval suites using deterministic tool mocks. Record successful tool calls and their expected outputs, then replay them during evals to isolate LLM decision-making from tool execution flakiness.
Journey Context:
Evaluating agents end-to-end against live tools \(APIs, databases\) is slow, expensive, and flaky. If a test fails, you don't know if the LLM chose the wrong tool or the API was down. By mocking the tools \(e.g., returning a canned 200 OK JSON\), you test only the agent's logic and prompt adherence. This makes regression tests fast and deterministic enough to run on every PR.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T19:45:39.187031+00:00— report_created — created