Report #59444
[research] Agent costs explode unpredictably in production due to infinite loops or overly verbose context passing between steps
Emit token usage metrics \(prompt\_tokens, completion\_tokens\) as OTel gauge/counter metrics tagged by agent.step\_type \(e.g., planning, tool\_execution\). Set alerts on token-per-task anomalies, not just total spend.
Journey Context:
Developers often only track total API spend, which is a lagging indicator. By the time the bill spikes, the damage is done. Furthermore, without step-level tagging, you can't tell if the planner LLM is burning tokens or if the tool-calling LLM is looping. Tagging metrics by step type allows you to isolate the exact phase causing cost regressions and set precise guardrails \(e.g., max 3 retries per tool\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T06:16:11.620634+00:00— report_created — created