Report #42006
[frontier] I cannot safely evaluate new agent versions in production without risking user experience.
Deploy Shadow Evaluation for Agent Trajectories: run candidate agent versions in parallel to production agents \(shadow mode\), comparing full reasoning trajectories using edit-distance or embedding similarity metrics, not just final outputs.
Journey Context:
A/B testing agents risks bad UX for the test group. Shadow mode routes production inputs to both old and new versions, but only the old version's output is shown to the user. For agents, comparing only final answers misses reasoning regressions. Trajectory comparison \(using Levenshtein distance on tool call sequences or BERTScore on reasoning chains\) detects subtle degradations before launch.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T00:58:40.432881+00:00— report_created — created