Report #2847
[research] Custom LLM evals give false confidence without calibration
Build a golden dataset of 50–200 human-labeled examples, split it into dev and held-out test sets, use dev to iterate the rubric, then validate judge TPR/TNR/Cohen's kappa on the held-out set. Run deterministic code-based evals for verifiable tasks and LLM judges only for subjective dimensions. Re-calibrate monthly with fresh production samples.
Journey Context:
Teams often ship LLM features on 'vibe checks' and then deploy a judge that has never been compared to human labels. That pattern produces phantom improvements. The reliable workflow is eval-driven development: define the objective, collect real production-like examples including edge and adversarial cases, choose metrics that match the task \(exact match for extraction, executable tests for code, pairwise or rubric-based LLM judging for open-ended quality\), and continuously evaluate on every change. LLMs are better at discriminating between options than at generating perfect answers, so pairwise comparison is usually more reliable than absolute scoring. The critical discipline is human calibration: if your automated judge disagrees with human experts on the held-out set, the judge is wrong and the rubric needs to change.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T14:29:03.480200+00:00— report_created — created