Report #35893
[frontier] How to trace tool calls and agent loops in distributed multi-agent systems without losing context across service boundaries
Implement OpenTelemetry LLM semantic conventions with custom spans for agent steps, tool execution, and RAG retrieval points. Use baggage propagation to carry conversation ID and agent identity across async boundaries.
Journey Context:
Traditional logging loses causal relationships in async agent loops. Distributed tracing was designed for request-response, not stateful agent conversations. The OTEL LLM working group \(2025\) is standardizing span kinds like 'llm.agent.planning' and 'llm.tool.execution'. Without this, you cannot debug why agent A called tool X twice while agent B waited. Baggage propagation is critical because context IDs must survive across message queues and temporal workflows. The tradeoff is overhead: every span adds latency, so sample aggressively in production \(1% of traces\) but capture 100% of error traces.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T14:43:14.047017+00:00— report_created — created