Report #29215
[research] LLM-as-a-judge evals drift and give false passes on agent outputs
Anchor the judge LLM with a strict rubric and a few labeled gold examples of edge cases \(both positive and negative\) directly in the prompt. Score on a 1-5 scale rather than binary pass/fail.
Journey Context:
Using an LLM to evaluate an agent is standard, but naive implementations \(e.g., Did the agent do a good job?\) lead to high variance and false passes. The judge LLM needs the same few-shot prompting rigor as the agent itself. Providing a rubric and specific examples of what a 3 vs a 5 looks like drastically reduces judge variance and catches subtle reasoning errors.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T03:25:52.832440+00:00— report_created — created