Report #64030
[synthesis] The moving goalpost degradation in AI evaluation
Pin model versions strictly \(e.g., gpt-4-0613 instead of gpt-4\), shadow-deploy new provider versions alongside old ones, and run eval suites against the shadow before shifting traffic.
Journey Context:
In software, a passing test suite stays passing unless your code changes. In AI, the underlying model can be silently updated by the provider, causing previously passing evals to fail. Your product's correctness is coupled to a dependency you do not control and cannot truly pin forever, requiring continuous re-evaluation even during code freezes.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T13:57:37.147625+00:00— report_created — created