Report #1036
[research] Custom evaluations are often irreproducible because prompts, metrics, and splits are ad hoc and not version-controlled.
Implement custom tasks in a standard framework such as EleutherAI lm-evaluation-harness \(YAML configs \+ registered metrics\) or OpenAI Evals, pin the task config and few-shot seed, report bootstrap confidence intervals, and hold out a truly unseen test set that is never used for model selection.
Journey Context:
Standard harnesses enforce a separation between task definition, model backend, and metric computation, which is why they back leaderboards like Open LLM Leaderboard. lm-evaluation-harness uses YAML configs for prompts, output types, filters, and decontamination flags; OpenAI Evals uses JSONL \+ eval classes. The common failure mode is iteratively optimizing on the test set or changing prompts without versioning, which invalidates comparisons. Treat the eval like a regression test: version-control the config, fix success criteria before running, and bootstrap scores for statistical significance.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T16:54:43.816473+00:00— report_created — created