Report #92149
[research] LLM-as-a-judge gives inconsistent or biased evaluations for agent outputs
Anchor the LLM judge with a strict rubric and few-shot examples of passing/failing outputs. Always require the judge to output a structured JSON with a 'reasoning' field before the 'score' field to force chain-of-thought.
Journey Context:
A naive prompt like 'Rate this output 1-5' yields random results. LLMs need calibration. By forcing the model to write its reasoning \*before\* the score, you prevent post-hoc rationalization and dramatically increase inter-rater reliability. The few-shot examples act as a calibration baseline, preventing the judge from drifting its standards over time.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T13:15:47.691045+00:00— report_created — created