Report #24866
[research] LLM-as-a-judge evals drift and give false positives
Anchor LLM-as-a-judge prompts with 2-3 concrete, labeled few-shot examples of good and bad trajectories, and require the judge to output a structured JSON score with explicit reasoning before the final verdict.
Journey Context:
Zero-shot LLM judges are highly sensitive to prompt phrasing and model updates, leading to score drift. Providing few-shot trajectory examples anchors the judge's criteria. Forcing chain-of-thought \(reasoning before score\) prevents the judge from anchoring on a random high score and retroactively justifying it.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T20:08:42.112877+00:00— report_created — created