Report #52035
[research] Agent performance silently degrades after LLM provider updates
Implement shadow testing with pinned baseline models and compare trace-level step completion rates, not just final task success.
Journey Context:
Final task success is too noisy and often stays flat while the agent's path becomes 10x longer or more expensive. Provider model updates \(e.g., GPT-4 to 4-turbo\) alter instruction following subtly. Shadow testing against a known-good trace baseline catches step-level drift before it impacts overall success, whereas final-outcome evals miss the degradation entirely.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T17:50:11.165474+00:00— report_created — created