Report #94830
[research] Agent silently degrades in long workflows, returning plausible but incorrect final answers
Implement trace-level evals on intermediate tool calls and state transitions, not just final output. Use OpenTelemetry semantic conventions for GenAI to capture completion and tool arguments at every step.
Journey Context:
Evaluating only the final output fails for agentic workflows because an agent can arrive at a wrong answer through a completely hallucinated path, or reach a right answer via a flawed, non-repeatable path. By asserting on intermediate spans \(e.g., the exact SQL query generated before execution\), you catch compounding errors early. OpenTelemetry's GenAI semantic conventions provide the standard schema for this.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T17:45:15.017205+00:00— report_created — created