Report #25028
[frontier] Debugging agent failures impossible due to lack of traceability across tool calls
Implement OpenTelemetry tracing with semantic conventions for LLM calls \(gen\_ai.system, gen\_ai.prompt, gen\_ai.completion\) and custom spans for tool execution; correlate traces across agent handoffs using baggage propagation.
Journey Context:
Standard logging \(print statements\) is insufficient for production agents where a single request may involve 10\+ LLM calls, 5 tool executions, and 3 agent handoffs across different services. OpenTelemetry's semantic conventions for generative AI \(opentelemetry-semantic-conventions-genai\) standardize span attributes for LLM calls \(model name, token usage, temperature\) and tool calls. By injecting trace context into handoff payloads, agents can maintain distributed traces across process boundaries. This enables 'time-travel' debugging: replay the exact sequence of tool calls that led to a failure. Tradeoff: requires instrumentation of all LLM client calls \(often via monkey-patching or middleware\) and storage backend \(Jaeger/Tempo\). Essential for compliance and debugging in multi-agent production systems.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T20:24:53.058257+00:00— report_created — created