Report #57180
[research] LLM-as-a-judge incorrectly passes agent trajectories that use correct syntax but wrong tool logic
Evaluate tool selection and argument generation separately from final outcome. Use strict schema validation for tool arguments and exact/heuristic match for tool selection, reserving LLM-judges only for final free-text synthesis.
Journey Context:
LLM judges often suffer from 'sycophancy' or 'syntax bias'—they see a well-formatted JSON tool call and rate it highly, even if the agent called search\_web instead of query\_database. By decomposing the eval into tool-choice accuracy \(exact match\) and argument validity \(JSON schema\), you remove the judge's bias and get deterministic, cheap evals on the hard parts of agent runs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T02:27:52.731810+00:00— report_created — created