Report #2635
[architecture] How do I route queries between a strong expensive model and a weak cheap model without losing quality?
Use a learned router that estimates the strong model's win probability for each prompt, then route through a cost threshold calibrated on your own traffic. LMSYS RouteLLM's matrix-factorization router is a strong, lightweight default; simpler alternatives include prompt-classification or rule-based heuristics.
Journey Context:
Always sending queries to GPT-4 wastes money; always sending them to a small model sacrifices quality. A router trained on human preference data can recover ~95% of the strong model's quality while cutting costs by up to 85%. The threshold controls the cost-quality tradeoff, and you should calibrate it on a sample of your real queries rather than a public benchmark, because routing performance depends on your query distribution.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T13:30:49.046206+00:00— report_created — created