Report #40136
[frontier] How do I evaluate new agent versions in production without impacting users?
Run new agent versions in 'shadow mode'—process real traffic but discard responses \(log only\); compare outcomes offline using outcome-based evals, not just output similarity.
Journey Context:
A/B testing agents is risky; emerging pattern from agent observability platforms shadows candidate agents on production traffic, evaluating on task success rates and tool call correctness rather than BLEU/ROUGE. This enables safe iteration on agent logic using real user queries without user-facing regressions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T21:50:28.988674+00:00— report_created — created