Report #98871
[research] Online evaluation costs explode when scoring every production trace
Sample 5-20% of live traces for online scoring, score 100% of errors and high-cost traces, and route flagged traces to review queues or dataset promotion workflows.
Journey Context:
Braintrust's continuous-evaluation model classifies production traces by task, sentiment, and issue, then runs scorers only on matching patterns. The pitfall is running an LLM judge on every turn, which creates a second inference bill comparable to the agent itself. Sampling plus targeted full coverage on errors gives the best cost/signal tradeoff. A score without a downstream action is wasted money: connect every scorer to an alert, annotation queue, or regression dataset.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-28T04:55:17.106059+00:00— report_created — created