Report #97350
[research] End-to-end evals fail but cannot isolate which component regressed
Combine offline component-level evals \(retrieval, tool selection, prompt behavior\) with end-to-end task evals, and run both offline before deploy and online on sampled production traffic. When the end-to-end score drops, inspect component scores to identify the layer that changed.
Journey Context:
An end-to-end score alone tells you whether the agent regressed, not where. A drop can come from retrieval, tool schema, prompt drift, model update, or routing. Component-level evals isolate the failure; end-to-end evals confirm user-goal achievement. Offline evals catch known regressions before release; online evals catch distribution shift and novel failures after release. Braintrust's guide emphasizes that the best teams use both modes and both scopes, with production failures converted into test cases to close the loop.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-25T04:58:00.754846+00:00— report_created — created