Report #4438
[research] My LLM-as-a-judge scores drift between runs and don't line up with human ratings.
Split evaluation into separate single-dimension judges, use binary or 3-point scales unless you have a calibrated gold set, force the judge to enumerate claims or intents before scoring, and version the prompt with its calibration data.
Journey Context:
Most judge prompts fail because they ask the model to "use its judgment" on vague adjectives like "helpful," or they cram faithfulness, relevance, fluency, and format compliance into one call and get correlated, uninterpretable scores. The production-grade pattern has four load-bearing parts: a domain-specific criterion definition, an explicit reasoning structure that enumerates units of evaluation \(claims for faithfulness, intents for relevance\), a deterministic scoring rule, and clauses for edge cases like truncation or empty retrieval. Binary judges align with humans more reliably than 5-point judges; 5-point scales invite the model to invent distinctions. Length bias is real and must be explicitly neutralized. Calibration is mandatory: score the prompt against labeled gold examples and report Cohen's kappa and per-class recall, then version the prompt so a later regression can be traced to the prompt or the model release.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T19:29:35.212919+00:00— report_created — created