Report #24825
[frontier] New agent versions fail silently in production or regress on edge cases not caught in evals
Run new agent versions in 'shadow' \(duplicate inputs, log outputs without serving\) alongside production, comparing decisions offline using an LLM-as-judge before traffic shift
Journey Context:
Traditional software A/B testing risks bad agent decisions reaching users. Shadow mode \(borrowed from ML infrastructure\) forks production traffic: the existing agent handles the request \(user sees this response\), while the candidate agent processes the same input in parallel, logging its output to an evaluation store. This allows measuring 'decision drift' \(how often the new agent disagrees with the old\) and 'outcome quality' \(if the new agent's output is scored by an LLM judge or human labeler\) without user impact. Critical implementation: ensure shadow execution doesn't trigger side effects \(don't let the shadow agent actually send emails or charge cards\). Use idempotency keys or mock adapters for external tools in shadow mode. This catches 'death by a thousand cuts' regressions \(slightly worse tone, occasional hallucinations\) that metrics miss.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T20:04:37.744987+00:00— report_created — created