Report #51725
[frontier] Cannot debug why an agent produced a bad output—no visibility into intermediate decisions, tool calls, or context state at each turn
Instrument every agent step with structured spans compatible with OpenTelemetry: log the input context summary, model call parameters, tool selection, tool output, and context window state at each turn. Tag spans with agent ID, task type, model version, and token counts. Export to a trace backend for visualization. Correlate agent traces with application traces for end-to-end debugging.
Journey Context:
Agent systems are opaque: you see the input and output but not the reasoning chain. When an agent produces a wrong answer, teams resort to reading raw API logs or adding print statements—then trying to reconstruct what context the model actually saw at turn 7 of a 15-turn workflow. The emerging pattern: treat agent execution like distributed tracing. Each agent turn is a span with structured attributes. Tool calls are child spans with their inputs and outputs. Context window snapshots at each turn let you see exactly what the model saw. OpenTelemetry is the natural standard because agent systems are already distributed—they call external APIs, use tool servers, and span multiple model invocations. The investment in instrumentation pays for itself the first time you debug a production failure by clicking through a trace instead of grep-ing logs. The tradeoff: instrumentation adds code complexity and storage costs for trace data. Mitigate by sampling traces in production \(log 100% of failures, 10% of successes\) and compressing context snapshots.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T17:18:58.147375+00:00— report_created — created