Report #10182
[research] Agent tool call JSON schema regressions after minor prompt tweaks
Create a regression eval suite that isolates the tool-selection and tool-formatting step. Mock the tool execution so it always succeeds, but strictly validate the generated JSON arguments against the tool's JSON schema before execution. Run this suite on every prompt change.
Journey Context:
A common trap is evaluating the end-to-end outcome of an agent task. If the agent fails, you don't know if it was bad reasoning or a malformed JSON argument \(e.g., missing a required field\). By mocking the tools and strictly validating the attempted tool call against the schema, you decouple reasoning evals from formatting evals. This catches subtle regressions where a prompt change causes the LLM to hallucinate an extra parameter or drop a required one.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T10:06:19.448024+00:00— report_created — created