Report #24399
[research] Agent success rate slowly degrades over weeks without triggering any failed assertions
Instrument agent runs with OpenTelemetry spans for tool calls and LLM interactions, and track distribution shifts in step counts, token usage, and tool selection frequency as continuous signals, alerting on statistical drift rather than binary pass/fail.
Journey Context:
Binary pass/fail tests don't catch silent degradation. An agent might still reach the correct answer but take 15 steps instead of 3, or start calling a deprecated tool before recovering. This indicates a degrading prompt or shifting model weights. Observability via OTel spans allows you to monitor the process metrics which degrade long before the outcome metrics fail.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T19:21:39.066622+00:00— report_created — created