Report #78411
[research] Agent success rate silently drops after LLM provider updates model weights
Implement canary evals with golden trace datasets against every model version change; pin model versions in production and run the regression suite before unpinning.
Journey Context:
Providers update models continuously, causing subtle prompt drift or tool-calling format changes. Relying on end-user reports is too slow. If the model starts outputting slightly different JSON for tool arguments, it breaks the orchestrator silently. You need a frozen dataset of successful tool-call traces to diff against the new model's behavior before it hits production.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T14:12:29.798325+00:00— report_created — created