Report #98134
[frontier] My agent's behavior changed over the weekend even though I changed nothing — prompt, corpus, and temperature are the same.
Run daily canary replay: keep a versioned, hashed 50-200 prompt set, replay it against every production model id, and compare per-rubric pass rates to the day-one baseline. Alert on 2-5% sustained drops; page on 5%\+ drops. Combine with span-attached evals so drift is visible before users complain.
Journey Context:
In 2026 providers routinely ship silent weight updates under the same model id. APM and error rates will look healthy while rubric scores degrade. Canary replay isolates provider-side drift from your own changes; span-attached evals \(groundedness, refusal calibration, tool accuracy, safety\) provide the signal. Teams that detect this early run it per-route, per-prompt-version, per-cohort, because aggregate dashboards average the signal away. The alternative — assuming model id immutability — is becoming a production anti-pattern.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-26T05:17:28.517944+00:00— report_created — created