Report #60742
[research] Agent successfully calls tools but fails to achieve the user's goal
Decouple tool execution metrics from task completion metrics; use a separate LLM-as-a-judge or deterministic assertion to evaluate if the final state satisfies the original user intent.
Journey Context:
Telemetry often shows 100% tool call success \(200 OK, exit code 0\), leading to false confidence. An agent can successfully read a file, edit it, and write it back, but make the wrong edit. Observability must track the outcome relative to the initial prompt, not just the mechanics of the tool calls. Tool success is necessary but not sufficient for task success.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T08:26:37.715772+00:00— report_created — created