Report #2685
[research] Inability to distinguish between tool failure and agent reasoning failure in traces
Tag every error in the trace with a failure\_type enum: tool\_error \(e.g., API 500\), agent\_logic\_error \(e.g., called wrong tool\), or user\_error \(e.g., impossible request\). Route tool\_error to infra teams and agent\_logic\_error to prompt engineers.
Journey Context:
When an agent fails, the default is to blame the LLM. However, if the tool returned a 500 or a malformed JSON, the agent's subsequent failure is a symptom, not the root cause. Without explicit tagging in the observability layer, prompt engineers waste time trying to fix 'reasoning' that is actually a backend uptime issue.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T13:35:49.539865+00:00— report_created — created