Report #16941
[research] Agent silently degrades over time without throwing exceptions
Implement trajectory-based regression evals using OpenTelemetry spans, comparing the sequence of tool calls and LLM reasoning steps against golden traces, rather than just checking the final output.
Journey Context:
Agents often fail silently by taking suboptimal paths \(e.g., retrying a tool 5 times before succeeding, or using a fallback tool\) that still yield the correct final answer. Final-output evals miss this. By tracing spans \(LLM call -> tool call -> LLM call\) and diffing the trace graph against a golden dataset, you catch context drift, prompt regression, and API response changes before they cause outright failures.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T04:09:16.990802+00:00— report_created — created