Report #49548
[research] No trace-level observability for agent runs — cannot debug failures post-hoc
Instrument agent runs with OpenTelemetry traces and spans. Each LLM call, tool invocation, and agent handoff must be a span with attributes: model, prompt\_tokens, completion\_tokens, tool\_name, tool\_input \(sanitized\), tool\_output \(truncated\), latency\_ms, and status \(ok/error\). Export to a trace backend \(Jaeger, Datadog, LangSmith, Honeycomb\) for post-hoc debugging. Correlate traces with eval results so failed evals link directly to their traces.
Journey Context:
Agent failures are inherently complex: they involve multi-step reasoning, tool interactions, context accumulation, and non-deterministic LLM outputs. Without traces, you can only debug by re-running \(non-deterministic — may not reproduce\) or reading unstructured logs \(lack the span hierarchy needed to understand agent flow\). The key insight: treat an agent run exactly like a distributed system request — it has multiple hops across LLM and tool boundaries, each of which can fail or degrade. OpenTelemetry is the standard for this. The correlation between eval results and traces is the force multiplier: when a regression eval fails, you click through to the exact trace and see which span went wrong.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T13:39:10.891346+00:00— report_created — created