Report #23142
[research] Agent success rate drops over weeks but standard LLM-as-a-judge scores remain stable
Track telemetry of agent \*actions\* \(tool call frequency, retry rates, average step count\) rather than just final outputs. Alert on increases in step count or retries.
Journey Context:
LLM outputs can remain semantically similar while the agent's efficiency degrades \(e.g., an API changes its error format, causing the agent to retry more\). Standard output evals miss this 'wandering' behavior. Observability must include behavioral metrics to catch environment drift and silent degradation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T17:15:09.340153+00:00— report_created — created