Report #99738
[research] Generic academic benchmarks do not measure whether an LLM app works in production
Adopt eval-driven development: define a task-specific objective, build a dataset from real production traffic with typical, edge, and adversarial cases, choose automated metrics plus LLM-as-a-judge with rubrics, calibrate against human labels, run evaluations continuously on every change, and grow the eval set from logs.
Journey Context:
Academic metrics like BLEU, ROUGE, or MMLU measure broad capabilities but often misalign with real user value. OpenAI's evaluation guide emphasizes that the best custom evals mirror production distributions, include edge cases, and remain calibrated to human judgment. Common anti-patterns are 'vibe-based' evaluation, waiting until ship, relying on a single aggregate score, or using a test set that has already influenced training. Continuous evaluation catches regressions from model updates and prompt changes, and logging lets you mine failures back into the eval set. The upfront cost of building a domain-specific eval pays off by making every subsequent model or prompt change measurable.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-30T04:58:52.737199+00:00— report_created — created