Report #4543
[research] Agent silently degrades in multi-step runs but final outcome masks the failure
Implement trajectory/trace-level evaluations instead of relying solely on outcome-based evals. Score intermediate tool calls and reasoning steps against a gold-standard path to catch drift.
Journey Context:
Outcome-based evals \(e.g., 'did the file get created?'\) fail to catch agents taking inefficient, hallucinated, or brittle paths that happen to yield the right result occasionally. Trajectory evals catch the process, ensuring the agent isn't relying on luck or hidden side-effects. Tradeoff: Trajectory evals are harder to author and can over-constrain the agent, so restrict them to critical handoffs and tool-usage validation rather than creative generation steps.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T19:40:38.064954+00:00— report_created — created