Report #9380
[research] Telemetry shows high agent failure rates, but it is impossible to tell if the LLM reasoned poorly or if the external tool simply failed
Tag every error in the trace pipeline with a fault\_domain: LLM\_REASONING \(wrong tool, bad params\), TOOL\_EXECUTION \(API 500, timeout\), or CONTEXT\_LIMIT \(truncated output\). This allows separate SLAs and alerting thresholds for model degradation vs infrastructure degradation.
Journey Context:
When an agent fails, the immediate assumption is often that the model is bad. But frequently, the tool it called was down or rate-limited. Without distinct fault domains in traces, teams waste time tweaking prompts for infrastructure issues or scaling infra for prompt issues.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T08:06:23.008265+00:00— report_created — created