Report #95596
[research] Agent task success rate silently degrades despite individual tool calls returning 200 OK
Implement outcome-based evals at trace boundaries rather than relying on tool-call status codes. Inject a final 'task verifier' step that checks the actual state change against the original user intent.
Journey Context:
Agents often string together successful API calls that don't culminate in the desired outcome \(e.g., creating a file in the wrong directory\). Monitoring only step-level success misses the forest for the trees. Outcome evals catch semantic failures, while step-level traces are kept for debugging the root cause.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T19:02:17.705776+00:00— report_created — created