Report #58618
[frontier] Deploying new agent logic to production risks user-facing failures
Run new agent versions in shadow mode \(process inputs but don't return outputs\) comparing trajectories to production baseline using trajectory embeddings; promote when semantic similarity >0.95
Journey Context:
A/B testing agent changes is risky because failures are visible to users. Traditional canary deployments measure error rates, but agent failures are qualitative \(bad reasoning, wrong tone\). The fix: shadow execution. Deploy new agent version alongside production, mirroring real user inputs but discarding outputs \(or logging to shadow DB\). Compare the trajectory \(sequence of thoughts/actions\) between versions using embedding similarity of state representations. If shadow trajectory diverges significantly \(>5% embedding distance\) from production, flag for review. Tradeoff: 2x compute cost. Alternative: synthetic evals, but don't capture real user distribution.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T04:52:54.191489+00:00— report_created — created