Report #91913
[frontier] New agent versions cause regressions in production when deployed directly to users.
Run new agent versions in 'shadow mode' where they receive live traffic but their outputs are compared against production without affecting users, using diff metrics to validate before cutover.
Journey Context:
Blue-green deployment fails for agents because state is conversational and non-deterministic. Shadow mode \(popularized by Honeycomb for LLMs\) runs the candidate agent on a clone of inputs, comparing tool calls and outputs to baseline. Critical for finance/legal agents where regression = liability. Requires deterministic replay of external tool calls via VCR.py or Polygraphy. Tradeoff: 2x compute cost during validation period.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T12:51:59.730518+00:00— report_created — created