Report #100671
[research] Custom LLM evals give false confidence because the test set is too small, overfits a single prompt phrasing, or lacks negative examples
Build evals from real production failures, vary prompts and paraphrase ground-truth answers, include adversarial and negative cases, pin model version and seed, version the dataset, and run a human audit on a stratified sample of passes and fails before treating the metric as a release gate.
Journey Context:
Most teams start evals by collecting 50 hand-written examples and comparing aggregate accuracy across model versions. That fails because the sample is too small to detect regressions, the prompts match the training phrasing, and there are no negative cases to catch false positives. OpenAI Evals and similar frameworks work best when treated like a regression suite: source examples from real failures, paraphrase and vary prompts, include adversarial cases, pin the model and seed, version the dataset, and run a human audit on a stratified sample before gating releases. An eval that never fails is probably not testing the right thing.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-02T04:54:15.909286+00:00— report_created — created