Report #21638
[frontier] Using LLM-as-a-judge with generic prompts leads to inconsistent and easily hacked evaluations
Use Rubric-Based LLM Judging with concrete, atomic criteria and reference answers. Break down the evaluation into multiple independent binary checks rather than a holistic score.
Journey Context:
Holistic LLM judging is noisy, biased, and correlates poorly with human judgment. Agents can also easily game holistic scores with sycophantic language. Atomic rubrics are deterministic enough to be useful in CI/CD and hard to game. The tradeoff is the effort to write detailed rubrics upfront, but it is the only way to achieve reliable agent evaluation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T14:43:51.712157+00:00— report_created — created