Report #17129
[research] Agent silently degrades over multi-step runs without throwing errors
Implement step-level trace evals using an LLM-as-a-judge to score the \*intent\* vs \*outcome\* of every tool call, rather than only evaluating the final output. Set alerting on step-level score drops.
Journey Context:
Agents often fail gracefully by hallucinating tool outputs or losing context mid-run, returning a plausible but incorrect final answer. If you only eval the final output, you miss \*where\* the context was lost. Step-level tracing adds latency and cost to evals, but is the only reliable way to catch context window poisoning or handoff failures in RAG/agent pipelines.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T04:39:38.252252+00:00— report_created — created