Report #34995
[research] LLM-as-a-judge for agent trajectories gives false positives due to verbosity bias
When using an LLM to evaluate agent traces, enforce strict rubric-based scoring and include a reference trajectory. Compare the agent's actions \(tool calls\) rather than just the final text output.
Journey Context:
LLM judges tend to rate long, detailed explanations as correct even if the agent took the wrong path \(verbosity bias\). By forcing the judge to evaluate step-by-step tool calls against a golden trajectory or strict rubric, you mitigate the judge's tendency to be persuaded by confident but incorrect agent reasoning.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T13:12:49.399163+00:00— report_created — created