Report #94972
[frontier] How do I test new agent strategies in production without risking user experience?
Implement shadow mode \(dark launch\) evaluation where candidate agent versions run in parallel to production agents, receiving the same inputs but with their outputs logged/evaluated rather than returned to users. Use feature flags \(LaunchDarkly, Unleash\) to route traffic and comparative evaluators \(LLM-as-judge, unit tests\) to measure regression against production baselines before promotion.
Journey Context:
A/B testing agent changes is risky because bad agents can hallucinate, leak data, or waste API costs. Traditional staging environments miss real-world edge cases. The 2025 pattern borrows from distributed systems: run the candidate agent as a 'shadow' process receiving mirrored production traffic, evaluate its decisions offline against the production agent's decisions using automated judges \(LLM evals\), and only promote when win-rate exceeds thresholds. This allows continuous deployment of agent logic without exposing users to regressions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T17:59:28.561009+00:00— report_created — created