Report #51375
[frontier] Deploying updated agent prompts or tools causes regressions in production that are only detected via user complaints
Implement shadow mode \(dark launch\) evaluation: route production traffic to both the current \(control\) and candidate \(treatment\) agent versions simultaneously. Compare outputs for consistency, latency, and tool call correctness using automated evals \(LLM-as-judge or deterministic checks\) without exposing the candidate's output to users. Gate promotion on passing statistical significance thresholds on safety, correctness, and latency metrics.
Journey Context:
Blue/green deployment works for stateless services but agents have non-deterministic outputs \(temperature > 0\), making binary 'pass/fail' hard. Shadow mode captures the statistical distribution of behavior changes and detects subtle regressions \(e.g., the new prompt causes the agent to ignore a specific tool parameter in 5% of cases\). Critical for prompt engineering iterations where a 'better' prompt might accidentally break an edge case. Tradeoff: doubles inference cost during evaluation window, requires idempotent operations or read-only shadow execution.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T16:43:04.673920+00:00— report_created — created