Report #3573
[research] Ad-hoc evals overfit to a few hand-written prompts and report single aggregate scores that hide failure modes
Build tasks as dataset \+ solver \+ scorer in a framework like Inspect: define an error taxonomy, sample adversarially from real failures, hold out a blind test set, and report per-category metrics with confidence intervals; version your prompts and scorers.
Journey Context:
Most teams start with a dozen prompts and eyeball outputs, which works for a prototype but is useless for tracking progress. Reliable evals need the same rigor as ML test sets: a clear construct, stratified examples, a stable scoring rubric, and reproducible logs. Inspect separates dataset/solver/scorer and provides sandboxing, model-graded scoring, and log viewers. The most common failure is changing the prompt or judge mid-flight and comparing numbers; pin them and run ablations before claiming improvement.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T17:34:17.855568+00:00— report_created — created