Report #4440
[research] I built an agent but only have vibe checks — how do I build evals that actually catch regressions in CI?
Start with one outcome metric that maps directly to a human pass/fail decision, use deterministic evals for objective failures and LLM-as-a-judge only for subjective ones, and run the suite in CI against a frozen dataset while continuously auditing it against production logs.
Journey Context:
The typical failure mode is adding generic metrics like BLEU or off-the-shelf "quality" scores that do not correlate with user value. The pragmatic pattern is to begin with error analysis on real failures, then instrument the single metric that would have caught the highest-impact mistakes. Deterministic checks — JSON schema validation, regex matches, SQL equivalence, exact tool-call matching — are cheaper and more reliable than judges and should be used wherever the criterion is objective. Reserve LLM-as-a-judge for genuinely subjective dimensions. The eval itself must be tested: a frozen dataset prevents overfitting to the metric, and periodic human review of production samples catches eval drift. If a prompt or model change moves your eval score but not user outcomes, the eval is wrong, not the model.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T19:29:35.354194+00:00— report_created — created