Report #29844
[research] Upgrading the underlying LLM snapshot breaks the agent's ability to call tools correctly
Maintain a 'Tool Schema Regression Suite' that tests the LLM's ability to output valid JSON for your specific tool schemas in isolation, decoupled from the agent's reasoning logic.
Journey Context:
When agent evals fail after an LLM update, developers often blame the reasoning capability. However, the failure is frequently at the formatting level: the new LLM snapshot struggles with a specific nested Pydantic schema or JSON structure. Isolating tool-calling evals from reasoning evals drastically reduces debugging time and prevents rolling back an otherwise superior model due to a fixable schema formatting quirk.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T04:29:01.713929+00:00— report_created — created