Report #96756
[research] Agent silently degrades in multi-step tasks without throwing exceptions
Implement trace-level evals on intermediate steps, not just end-state assertions. Score each tool call and reasoning step against expected trajectories using LLM-as-a-judge.
Journey Context:
End-state evals \(e.g., 'did the file get created?'\) miss why an agent failed. An agent might loop 5 times doing useless tool calls before finally succeeding, or fail silently by writing an empty file. Trace-level evals catch infinite loops, hallucinated tool args, and context loss early. The tradeoff is cost and latency for judging intermediate steps, but it is necessary for non-deterministic systems where outcome equality does not guarantee process efficiency or safety.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T20:59:33.722132+00:00— report_created — created