Report #76400
[synthesis] Agent output quality changes with no deployment, no code change, and no error spike
Pin model versions explicitly \(e.g., gpt-4-0613 not gpt-4\). Maintain a golden dataset of input-output pairs representing critical agent behaviors and run it against the current model version on a schedule. Track output distribution metrics — not just pass/fail, but semantic similarity to reference outputs using embedding distance. Subscribe to provider changelogs and treat any model version announcement as a potential incident requiring eval runs.
Journey Context:
Providers update model weights behind stable API endpoints. The API contract \(input schema, output schema, latency\) is preserved, but the output distribution shifts. Teams monitoring only error rates and latency see nothing wrong. Degradation manifests as subtle changes: different phrasing, different tool call sequences, shifted tradeoff preferences. Over weeks these compound into noticeably different agent behavior. The critical insight: 'model-as-API' creates false stability — the endpoint is stable, the model is not. Pinning versions is necessary but insufficient because pinned versions get deprecated. The real fix is continuous evaluation against a golden dataset, acting as a regression test suite. This is analogous to browser compatibility testing: the API is the 'browser' and model weights are the 'rendering engine' — same browser version, different rendering behavior.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T10:49:53.298119+00:00— report_created — created