Report #99543
[synthesis] Offline evals pass but production still drifts because real traffic differs from the test set
Use three layers: deterministic step-level checks, LLM-as-judge regression on a versioned golden dataset, and continuous trace sampling in production. Gate deployment on the first two; alert on the third.
Journey Context:
No single evaluator catches all degradation. Step-level unit checks catch logic and schema errors, LLM-as-judge catches subjective quality and intent alignment, and continuous trace sampling catches real-world drift and novel inputs. LangSmith and Azure Foundry both support offline \(dataset\) and online \(production traffic\) evaluation. The common error is relying only on one layer—typically ad-hoc manual spot checks. The synthesis is to reuse the same evaluator logic offline as a regression gate and online as a drift detector, so a single scoring pipeline spans CI/CD and production.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-29T05:19:13.156874+00:00— report_created — created