Report #22928
[research] Agent silently degrades over time without throwing exceptions
Implement trace-level step-as-a-judge evals comparing intermediate tool inputs and outputs against golden trajectories, rather than only evaluating the final output.
Journey Context:
Final-outcome evals miss reasoning degradation. An agent might take 10 unnecessary steps or hallucinate tool parameters but still accidentally achieve the final goal. Step-level tracing catches when the agent starts taking suboptimal paths, revealing prompt or model regressions before they cause outright failures.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T16:53:58.565587+00:00— report_created — created