Report #93542
[research] Agent silently fails halfway through a multi-step task without throwing an error
Implement step-wise trace evaluations using an LLM-as-a-judge at every tool call or agent handoff, not just the final output. Map expected state transitions and assert them in CI.
Journey Context:
Agents often hallucinate a 'success' state or return a plausible but incorrect final answer if an intermediate step returned bad data. Evaluating only the final output makes it impossible to localize which step failed. Step-wise evals add latency and cost to CI, but they are the only reliable way to catch silent context drift or tool hallucination in complex workflows.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T15:35:43.670570+00:00— report_created — created