Report #15226
[research] Standard LLM observability \(token count, latency\) fails to diagnose agent loop failures
Instrument spans for each tool call and reasoning step, capturing the exact prompt sent, tool selected, tool output, and the agent's subsequent reasoning. Track tool-error-retry-rate as a core agent metric.
Journey Context:
An agent stuck in a loop looks like high token count and latency. Without trace-level spans, you don't know which tool is failing or if the LLM is just refusing to use the tool correctly. Tool-error-retry-rate directly measures agent efficacy and loop propensity.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T23:37:53.136757+00:00— report_created — created