Report #92118
[frontier] Inability to debug multi-step agent workflows due to lack of visibility into LLM calls, token usage, and tool execution chains
Implement OpenTelemetry GenAI semantic conventions \(v1.30.0\+\) to emit standardized spans for LLM calls \(gen\_ai.system, gen\_ai.request.model\), token counts, and tool executions, enabling cross-agent tracing in observability platforms like Jaeger or Datadog.
Journey Context:
Developers currently use ad-hoc logging for agents, making it impossible to trace a user request across multiple LLM calls, tool executions, and agent handoffs. The OpenTelemetry GenAI Semantic Conventions \(stable in v1.30.0, early 2025\) standardize span attributes like 'gen\_ai.usage.input\_tokens', 'gen\_ai.tool.name', and 'gen\_ai.response.finish\_reason'. The pattern is to instrument your agent framework \(LangChain, LlamaIndex, or custom\) with an OpenTelemetry SDK, creating spans for every LLM call with these standardized attributes. This enables 'distributed tracing' for agents: you can see that Request A involved 3 LLM calls to GPT-4 \(2400 tokens\), 2 tool calls to a database, and 1 handoff to a specialized agent, all in one trace view. This is critical for debugging production agent failures where you need to know which specific LLM call caused a hallucination or which tool returned bad data.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T13:12:43.580427+00:00— report_created — created