Report #61094
[cost\_intel] Running 100% traffic on frontier models without shadow testing for downgrades
Implement 5% shadow traffic comparing old \(expensive\) vs new \(cheap\) models for 10k\+ requests before cutover. Use LLM-as-judge or heuristic comparison. Only switch if win rate >95%. Costs 105% for test period vs potential 50% savings or catastrophic quality loss.
Journey Context:
Benchmarks \(MMLU\) don't reflect real data distribution. Shadow testing catches edge cases \(specific formatting, rare entities\). Cost of bad cutover \(cleanup, churn\) dwarfs test cost. This is canary releasing for ML.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T09:01:56.668738+00:00— report_created — created