Report #70598
[research] Silent agent degradation goes undetected between releases
Deploy continuous eval canaries: run a fixed representative task set \(20-50 cases\) on a schedule \(daily or per deploy\) against the current production agent. Track per-task success rates over time. Alert on statistically significant drift \(e.g., 2\+ consecutive runs below baseline on the same task\), not binary pass/fail.
Journey Context:
Silent degradation happens because LLM providers update models without notice, tool APIs change schemas, and prompt drift accumulates across edits. Unlike traditional software, agents don't crash — they get subtly worse. Point-in-time evals at release miss between-release degradation. Continuous canary evals catch it. The key design choice: alert on drift, not threshold, because absolute thresholds are brittle across model versions. LangSmith supports automated eval pipelines for this pattern, but a simple cron job \+ eval script \+ Slack alert works for teams without observability platforms.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T01:05:05.909879+00:00— report_created — created