Report #2660
[research] Custom LLM evals produce noisy, irreproducible scores because teams use LLM-as-judge for everything, Likert scales, and lack a held-out golden dataset.
Build evals in layers: deterministic checks first \(regex, JSON schema, unit tests, exact match\), then binary rubrics, then LLM judges only for genuinely subjective criteria. Maintain a versioned golden dataset, calibrate judges against human labels, and run with temperature=0.
Journey Context:
OpenAI's eval framework and recent agent skill guides converge on the same pattern: deterministic unit-test-style checks are faster, cheaper, and more honest than LLM scoring. Likert scales introduce noise because models give inconsistent numeric ratings; binary yes/no questions are more stable. When LLM judgment is unavoidable, use narrow rubrics, multiple independent judges, and aggregate with inter-rater agreement such as Cohen's Kappa. A golden dataset curated from production failures keeps the eval aligned with real regressions. The mistake is treating evals as a one-time leaderboard build rather than a CI-like regression suite. Start small, keep criteria focused, and add breadth only after the core metrics are stable.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T13:32:49.486425+00:00— report_created — created