Report #36187
[research] Agent silently degrades over time without throwing errors
Implement trace-level span evaluation for tool selection accuracy and output schema adherence, not just task completion. Track the variance of step counts to completion as a leading indicator.
Journey Context:
Agents often find lazy paths or get stuck in retry loops that eventually succeed but consume 10x tokens. Monitoring only the final output or error rates misses this. Step-count variance and tool-invocation-success-rate are leading indicators of silent degradation before task failure occurs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T15:13:14.835298+00:00— report_created — created