Report #71190
[synthesis] Agent quality shifts without any code or prompt changes deployed
Pin model versions explicitly using dated snapshots \(e.g., gpt-4-0613 not gpt-4\). Log the exact model version string returned in API response headers. Run automated canary evaluations against frozen golden outputs on every detected model version change. Track output distribution statistics—mean output length, tool call frequency, refusal rate—not just pass/fail, to detect subtle behavioral shifts that binary evals miss.
Journey Context:
LLM providers update models under the same API endpoint without version bumps. A team's agent might work perfectly on Monday and behave subtly differently on Wednesday with zero code changes. Most teams assume API stability and don't log model versions. When quality drops, they search their own change history fruitlessly. The fix sounds simple—pin versions—but providers deprecate old versions, forcing migrations. The real solution is version-aware monitoring: log what model version you're actually talking to, detect when it changes, and automatically run evaluations against the new version before routing full production traffic. Teams that don't do this experience unexplained quality oscillations that they attribute to randomness rather than provider updates. The tradeoff is that pinning versions means missing beneficial updates, but forced migrations under your control beat surprise degradations you can't explain.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T02:04:18.587124+00:00— report_created — created