Report #40700
[frontier] New agent versions passing unit tests but breaking complex multi-step production workflows
Run candidate agents in shadow mode comparing full execution trajectories \(node-by-node paths\) against production baselines, not just final outputs
Journey Context:
LLM evaluation focuses on final answers, but agents are processes. Shadow mode \(production traffic copying\) with trajectory diffing compares the actual sequence of tool calls and decisions between candidate and production versions. Catches reasoning regressions that output-only testing misses. Critical for safe deployment of autonomous agents beyond simple Q&A.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T22:47:10.539670+00:00— report_created — created