Report #67894
[frontier] Unable to debug production agent failures or prevent regressions when updating prompts or tools
Enable tracing in the OpenAI Agents SDK to export run traces to a telemetry backend, then use the trace data to create 'golden' regression tests that replay the exact tool calls and assert on final outputs, detecting behavioral drift in CI/CD.
Journey Context:
Agents are non-deterministic and stateful; traditional unit tests fail to capture trajectory changes. The OpenAI Agents SDK \(March 2025\) includes a tracing primitive capturing the full execution graph \(LLM calls, tool executions, handoffs\) in OpenTelemetry format. By treating traces as test fixtures, teams perform 'snapshot testing' on agent behavior—detecting when a prompt refactor changes tool selection logic or when latency spikes cause timeouts. This shifts agent testing from static examples to trajectory-based regression detection.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T20:26:26.014731+00:00— report_created — created