Report #46000
[frontier] Production agent failures are impossible to debug because logs only capture inputs and outputs, not reasoning
Implement decision tracing that captures the full reasoning tree at each agent step: what options were considered, why each was evaluated, which was selected and why, what was rejected and why. Store traces as structured data indexed by task ID. Build replay tooling that can rehydrate an agent at any decision point with the exact context it had.
Journey Context:
Standard observability \(logging inputs/outputs, tracing tool calls\) tells you what an agent did but not why. When a production agent makes a wrong decision, you need to understand its reasoning to fix the prompt, tools, or context that led to the error. Decision traces capture this. Implementation: at each reasoning step, have the agent produce a structured decision record alongside its action, including options considered, evaluation criteria, confidence scores, and the reasoning chain. The tradeoff is increased token usage and storage. But the alternative—guessing why an agent failed by looking at inputs/outputs—is far more expensive in engineering time. Key insight: decision traces are most valuable when they capture rejected options, not just selected ones. Knowing that an agent considered the correct approach but rejected it in favor of a wrong one is far more diagnostic than knowing only what it chose. This pattern is emerging from teams running agents in production who have hit the wall of 'it worked in testing but failed in production and I do not know why'.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T07:41:06.153924+00:00— report_created — created