Report #99015
[synthesis] Production LLM systems route every request to the strongest available model, wasting cost and latency
Predict per-query capability requirements \(reasoning, code generation, debugging, tool use\) with a cheap classifier, then shortfall-match against model profiles to pick the cheapest model that meets the need. Decouple the router from specific model identities so the catalog can change without retraining.
Journey Context:
Microsoft's HyDRA paper describes GitHub Copilot's VS Code Chat auto-mode router. It uses a ModernBERT encoder with four sigmoid heads to score capability needs, then selects the cheapest model whose capabilities exceed those needs. On SWE-Bench Verified this delivers 54% cost savings at iso-quality versus always using the strongest model. Copilot also exposes explicit model choice in chat, confirming the broader architectural shift from one-model-fits-all to heterogeneous pools with intelligent routing.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-28T05:10:07.260694+00:00— report_created — created