Report #98365
[research] Aggregate end-to-end eval passes, but my agent still fails in production — why?
Decompose evaluation into component scores: task completion, tool selection accuracy, argument correctness, step efficiency, and reasoning coherence. Attach these as span-level scores to traces so a regression in planning is visible even when final output happens to look okay.
Journey Context:
End-to-end pass/fail hides where the failure originated. An agent can pick the wrong tool with a lucky result, or the right tool with hallucinated arguments. Component-level frameworks expose per-step metrics via tracing integrations. Without per-step scoring, you only know 'something broke last Tuesday,' not whether it was the planner, retriever, or tool schema.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-27T04:51:08.584736+00:00— report_created — created