Report #100705
[research] Production failures are debugged once and forgotten instead of becoming regression coverage
Convert every production trace that fails an online scorer into an offline eval case; run the same eval definitions in CI and production so the regression suite grows from real user behavior.
Journey Context:
Pre-deployment evals only cover scenarios the team anticipated. Production traffic contains the edge cases that matter. The trace-to-eval loop closes the gap: online evals catch regressions live, failed traces become dataset rows, and offline evals block future merges that reintroduce the failure. Platforms like Braintrust expose this as a one-click workflow, but the discipline can be implemented with any trace store and eval harness that share a common schema.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-02T04:57:29.182703+00:00— report_created — created