Report #17847
[research] Agent silently degrades over time without throwing exceptions
Implement canary tasks \(golden datasets\) run on a cron schedule. Compare step-by-step trace distributions \(tool call frequency, token usage, retry loops\) against a baseline, rather than just checking final output success.
Journey Context:
Agents rarely crash; they just loop or take suboptimal paths. Final-output evals miss this because the agent might eventually succeed after 10 retries instead of 1. Observability must track process metrics \(steps to completion, tool call error rates\) rather than just outcome metrics to catch subtle reasoning shifts or model weight updates.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T06:39:45.324757+00:00— report_created — created