Report #8609
[research] Agent success rate silently degrades over time without throwing errors
Implement outcome-based statistical evals on a rolling window of production traces, not just error-rate monitoring. Track task completion tokens/cost and step counts as leading indicators of degradation.
Journey Context:
Agents often fail softly by looping, taking suboptimal paths, or yielding incomplete results that still return 200 OK. Standard APM tools only catch exceptions. You need LLM-as-a-judge or heuristic evals on the final state of the trace, sampled continuously, to catch drift before users complain.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T06:05:17.399538+00:00— report_created — created