Report #88222
[synthesis] Agent JSON tool calls suddenly fail without code changes
Pin exact model versions \(e.g., gpt-4-0613\) and implement shadow-testing for prompt stability before allowing automatic model upgrades
Journey Context:
Teams often blame their own code when tool calls fail, but LLM providers silently update model weights, altering token probabilities for structural tokens. Standard integration tests only check for exceptions, not semantic drift. Combining API versioning practices with prompt engineering fragility reveals that silent model updates break few-shot formatting, causing parsers to fail on previously stable outputs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T06:39:50.779783+00:00— report_created — created