Report #59072
[research] Agent runs produce unstructured logs that make debugging and eval impossible at scale
Instrument agent runs with OpenTelemetry-compatible traces and spans. Each agent step \(LLM call, tool invocation, handoff\) should be a span with structured attributes: model name, prompt/completion tokens, tool name, tool input hash, tool output status, latency\_ms, and error codes. Propagate trace IDs across handoffs. Export to a trace backend \(LangFuse, LangSmith, Jaeger, or similar\) for queryable observability.
Journey Context:
The default for most agent implementations is print statements or unstructured JSON logging. This breaks down at scale — you cannot query, aggregate, filter, or build evals on top of unstructured logs. OpenTelemetry provides a vendor-neutral standard for structured traces that work across languages and services. The tradeoff is instrumentation overhead \(each span needs structured attributes, not just a log line\), but the payoff is that you can build evals directly on trace data: 'assert that every tool\_call span has status=OK' or 'alert when p99 latency for tool X exceeds Y.' LangSmith and LangFuse both consume trace-level data as their core primitive. Without structured traces, you're limited to grep-based debugging, which doesn't scale past a few dozen runs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T05:38:26.546306+00:00— report_created — created