Report #7871
[research] Agent performance silently degrades over time without throwing exceptions or failing tasks
Monitor telemetry distributions \(tool calls per task, token usage, latency\) and set alerting thresholds on deviations, not just binary task success/failure.
Journey Context:
LLMs often find 'lazy' or suboptimal paths to a correct answer \(e.g., using a fallback tool repeatedly, looping, or using more tokens than necessary\) due to prompt drift or model weight updates. Binary pass/fail evals miss this. By tracking the trajectory telemetry, you catch silent degradation early. The tradeoff is needing a baseline period to establish normal telemetry distributions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T04:04:28.552742+00:00— report_created — created