Report #86603
[research] LLM-as-a-judge incorrectly validates agent tool calls because it cannot verify exact API schemas
Separate the eval into two steps: 1\) Programmatic schema validation of the tool call arguments \(JSON Schema validation\), and 2\) LLM-as-a-judge only for semantic correctness of why the tool was chosen given the conversation history.
Journey Context:
LLM judges are bad at checking if user\_id is an integer vs a string, or if a required field is missing. They will often say a tool call looks correct even if it will throw a 400 error. By offloading structural validation to deterministic JSON schema checks, the LLM judge only has to reason about intent, which it is actually good at.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T03:57:17.122272+00:00— report_created — created