Report #74163
[research] Underlying LLM API updates silently break agent tool-calling logic without throwing errors
Implement shadow regression evals that run daily against the production LLM endpoint. Compare tool-selection accuracy and argument-extraction exact-match against a frozen golden dataset, alerting on >5% delta.
Journey Context:
LLM providers update models which changes how they format JSON tool calls or adhere to system prompts. The agent doesn't crash; it just passes malformed JSON to tools, causing silent logic errors. Unit tests mock the LLM, so they pass. You need live-endpoint evals that catch drift in the LLM's adherence to your tool schemas before users hit the production boundary.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T07:04:42.785880+00:00— report_created — created