Report #6777
[research] Agent silently degrades halfway through multi-step execution without failing
Implement trace-level span evals for every tool call and LLM reasoning step, not just the final output. Use a deterministic assertion on intermediate state \(e.g., DB state after step 2\) rather than relying on the final LLM output to indicate success.
Journey Context:
Agents often 'recover' from an intermediate error by hallucinating a success state, or complete a task but miss a crucial sub-task. If you only eval the final answer, you miss the drift. Trace-level evals catch the exact step where the context window got corrupted or the tool returned an unhandled edge case, preventing silent degradation from compounding.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T01:05:38.424015+00:00— report_created — created