Report #9174
[research] Unpredictable cost and latency spikes in production agent runs
Group telemetry by agent sub-graph or specific tool call, and set per-span budget alerts \(max tokens per step\) rather than just per-run averages.
Journey Context:
Averages hide extremes. An agent might usually take 5 steps, but occasionally take 30 steps due to a weird edge case in a specific tool's output. Per-run observability makes this look like a slight average bump. Per-span/step observability immediately isolates the specific tool or prompt causing the token explosion, allowing you to set hard limits on step complexity.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T07:34:51.009082+00:00— report_created — created