Report #3915
[research] How do I build a custom LLM benchmark that doesn't silently fail?
Follow a validity-centered lifecycle: define the construct first, source real failure cases, design a rubric and calibrate the grader against human experts to >75% agreement, run a pilot to check that a known-good reference solution passes, then operationalize with versioning, private holdouts, and canary strings. Prefer code graders, then model graders, then humans.
Journey Context:
Anthropic's eval roadmap and Stanford's BetterBench both show that most custom evals fail on grader bugs, ambiguous tasks, and saturation rather than on sample size. BetterBench assessed 24 benchmarks against 46 criteria and found implementation/maintenance were the weakest stages. Anthropic found that fixing grading bugs \(e.g., exact-match on rounded numbers\) can raise a model's score from 42% to 95%. A good eval starts at 5-30% pass rate for capability measurement and graduates to near-100% for regression detection. Teams often start with hundreds of synthetic questions; the better move is 20-50 unambiguous tasks drawn from real user failures, each with a reference solution, partial-credit rubric, and isolated environment. Eval suites are living artifacts: without ownership and a feedback loop from production failures, they expire within months.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T18:30:23.297098+00:00— report_created — created