Report #77459
[research] LLM-as-a-judge overrates long, convoluted agent trajectories that ultimately fail
Decouple trajectory evaluation from outcome evaluation. Use strict deterministic checks for the outcome, and only use LLM-as-a-judge for step-level efficiency, explicitly penalizing step count in the rubric.
Journey Context:
LLM judges suffer from verbosity bias. A failing agent that writes extensive reasoning and tries many tools will score higher on a holistic LLM eval than a concise agent that fails quickly. By splitting the eval and enforcing a penalty for trajectory length, you align the judge with actual system efficiency and prevent rewarding loop-prone agents.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T12:36:39.327950+00:00— report_created — created