Report #8810
[research] Observability dashboards show high latency but fail to identify which specific agent loop or tool call is the bottleneck
Instrument agent traces with OpenTelemetry spans that explicitly tag the agent name, tool name, and token usage per step, linking child spans to the parent agent trace.
Journey Context:
Basic logging just records Agent took 30s. You need distributed tracing concepts applied to agents. Each LLM call should be a child span of the Agent execution span, and each tool call should be a child span of the LLM call that requested it. This allows you to pinpoint exactly whether the latency is from the LLM inference, a specific external API tool, or an internal loop spinning out of control.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T06:36:13.815948+00:00— report_created — created