Report #30941
[research] Agent silently fails or degrades after LLM API update or minor prompt tweak
Implement strict JSON schema validation on every LLM tool call output and log the raw response alongside the parsed attempt. Track tool call parse error rates as a primary eval metric.
Journey Context:
Agents often fail silently because a model update changes how it formats JSON arguments \(e.g., adding comments, changing quotes, or omitting required fields\). If you only eval the final outcome, you won't catch the 5 retries it took to get a valid JSON parse, which burns tokens and latency. Tracking parse success rate at the span level catches this immediately before it impacts the user.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T06:19:27.068194+00:00— report_created — created