Report #14050
[research] Agent silently degrades without throwing exceptions or failing tasks
Implement trace-level span evaluations \(LLM-as-a-judge\) on intermediate steps, not just final output checks. Score tool-selection accuracy and context retention per span.
Journey Context:
Agents often produce valid JSON and return 200 OK but hallucinate parameters or lose context from previous steps. Relying on final task success misses the 'wandering' behavior where the agent takes suboptimal paths. Evaluating intermediate traces catches context drift early before it compounds into a visible failure.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T20:37:10.126584+00:00— report_created — created