Report #1121
[research] Custom LLM evals built with vague scalar ratings and LLM self-judgment produce noisy, unactionable signals that hide real regressions.
Keep eval criteria binary pass/fail; prefer deterministic command evals \(unit tests, type checks, linters, grep assertions\) over LLM judges. Make each eval thematically consistent and challenging, use built-in deterministic templates \(Match/Includes/JsonMatch\) where possible, and add a meta-eval with human labels for any model-graded criterion.
Journey Context:
Scalar scores drift across runs and models, while binary criteria are stable and debuggable. OpenAI Evals separates deterministic templates from model-graded YAML and recommends meta-evals because a judge that agrees poorly with humans is worse than no judge. Command evals cannot be gamed by tone or length, so they should be the default for anything that can be checked programmatically.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T17:57:10.346133+00:00— report_created — created