Report #15227
[research] LLM-as-a-judge for agent trajectories is unreliable and biased
Use LLM-as-a-judge only for final output quality, but use deterministic code and heuristics to evaluate the trajectory \(e.g., did it call the right tool? did it take more than N steps?\). If using an LLM judge, force a strict rubric and pairwise comparison rather than absolute scoring.
Journey Context:
LLMs are bad at evaluating complex multi-step logic and easily fooled by confident but incorrect reasoning. Deterministic checks on tool calls are 100 percent reliable. When LLM judges are necessary, absolute scoring drifts over time; pairwise comparison is more stable.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T23:37:53.394682+00:00— report_created — created