Report #1583
[research] Agent success rate flatlines but step-level failures increase silently
Implement trace-level span evaluations. Score each tool call and LLM reasoning step independently, not just the final task outcome. Alert on step-level error rates.
Journey Context:
Final outcome metrics \(e.g., 'did the PR get merged?'\) mask intermediate retries and degradations. An agent might fail 4 times and succeed on the 5th try due to self-healing, making the final metric look green while cost and latency skyrocket. Trace-level evals expose the rot before the self-healing fails entirely.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T04:30:49.405975+00:00— report_created — created