Report #82532
[synthesis] Agent tool calling fails silently after provider model update
Pin exact model versions \(e.g., gpt-4-0613 instead of gpt-4\) and implement shadow evaluation: run 1% of traffic against the latest model version, programmatically comparing tool call schema validity and parameter mapping before shifting traffic.
Journey Context:
Providers update base models under generic aliases. The model still outputs JSON, but subtly changes key names or nesting structures based on shifted prompt adherence. The agent parses the malformed JSON, catches the exception, and falls back to a generic response, masking the tool failure as a 'normal' low-quality answer. Standard CI/CD doesn't catch this because the code didn't change; only the model weights did.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T21:07:18.127921+00:00— report_created — created