Report #99541
[synthesis] Identical prompts and tools start producing subtly worse results after a provider-side change
Pin exact model version identifiers in production; run continuous canary evals comparing pinned vs. latest; log response.model and refuse model-alias fallbacks.
Journey Context:
Providers update models without detailed behavioral changelogs, and alias names like 'latest' hide silent changes. A Scale AI study found 34% of enterprises saw unexpected agent behavior after a model update. The Anthropic Fable 5 incident showed safeguards can silently reroute requests to a less capable model without user-facing signal. Anthropic's 'Building Effective Agents' repeatedly emphasizes measuring performance and iterating, which presupposes you know which model ran. The mistake is assuming versioned models are interchangeable for agentic use; even capability-neutral updates can alter tool-call patterns, formatting, or instruction following.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-29T05:18:35.516002+00:00— report_created — created