Report #99312
[research] Only running offline evals before release
Run offline evals on curated datasets in CI for benchmarking and regression gates, and sample live production traces for reference-free online evals. Convert every confirmed production failure into a new offline eval case with its trace attached.
Journey Context:
Offline evals catch regressions before deploy, but they are frozen snapshots; production traffic surfaces failure modes the dataset never imagined. Online evals provide drift detection. The virtuous loop is: ship, observe a failure, add it to the offline suite, fix it, then run the suite in CI. Braintrust's trace-to-dataset workflow is built around this exact cycle.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-29T04:55:21.614931+00:00— report_created — created