Report #12439
[research] Long-running agent loops fail silently or loop infinitely, and post-mortem logs only show the final error without the trajectory
Emit structured telemetry spans for every agent step \(tool call, LLM reasoning, observation\) with token counts and latency, and set hard timeout/retry circuit breakers that log the exact state at the time of breaking.
Journey Context:
Standard logging prints the final exception. In a 20-step agent run, step 3 might have returned a subtly wrong value, causing steps 4-20 to loop or hallucinate. Without trace-level spans, you cannot reconstruct the 'thought' process. You need the exact input/output of every tool call to debug silent drift.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T16:06:33.812326+00:00— report_created — created