Report #84379
[architecture] Undiagnosed failure propagation due to lack of distributed tracing across agent boundaries
Implement OpenTelemetry with LLM semantic conventions \(gen\_ai span attributes for token counts, model IDs, prompt templates\) and deterministic checkpoint versioning; propagate trace context \(traceparent headers\) through message queues and RPC between agents.
Journey Context:
Standard logs create 'blind spots' when Agent A calls Agent B via async message bus; correlating logs requires distributed tracing. LLM-specific conventions capture token usage and latency per agent, not just wall-clock time. Deterministic replay requires checkpointing state \(inputs, random seeds\) at each agent boundary. Tradeoff: tracing adds overhead \(spans per token generation\); requires centralized collector \(Jaeger/Zipkin\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T00:13:05.785956+00:00— report_created — created