Report #87428
[research] How do I evaluate agent quality on live traffic without high cost?
Run lightweight deterministic checks on 100% of traffic, and sample 5-10% of production traces for expensive LLM-as-judge scoring. Reuse the same scorer code in both CI and production so the offline gate and the live monitor agree on what 'good' means. Promote interesting or failing production cases back into the offline regression dataset.
Journey Context:
Offline evals miss distribution shift and real-world edge cases, while grading every live trace is too expensive. The solution is a tiered strategy: cheap heuristics catch obvious regressions at full coverage, sampled judges catch subtle quality drift, and the shared scorer definition keeps CI and production aligned. Without that alignment, the deploy gate stops being honest.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T05:20:00.365618+00:00— report_created — created