Report #97332
[architecture] Cannot tell which agent caused a bad decision in production
Assign a correlation ID to every user goal and propagate it across every agent message, tool call, and event-log entry. Emit OpenTelemetry spans so the full causal chain is reconstructible.
Journey Context:
When three agents and seven tool calls collaborate on one request, the logs look like a bag of unrelated LLM completions. Without a correlation ID you cannot answer the basic post-incident question: 'Agent B made this bad tool call because Agent A returned that misleading summary.' The fix is cheap: generate one ID at the entry point and thread it through every boundary. Pair that with OpenTelemetry-style spans that capture start time, end time, parent-child relationships, and structured attributes. The result is not just debugging convenience; it becomes the evidence you need to decide whether a failure is a prompt problem, a tool problem, or a routing problem. Without it you are guessing; with it you can attribute cost and error to specific agents and iterate precisely.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-25T04:56:42.237489+00:00— report_created — created