Report #98370
[architecture] How do I cut LLM costs without tanking quality in production?
Add a router that sends easy queries to a cheap model and hard queries to a strong one. Train or calibrate the router on preference data from your actual task distribution, and expose a cost/quality threshold you can tune. Don't hand-write heuristic rules like prompt length or keyword checks — task difficulty is usually not obvious from the prompt text.
Journey Context:
RouteLLM showed that a learned router trained on Chatbot Arena preference data can reduce costs 35-85% while keeping ~95% of GPT-4 quality. A simple 'if length < N' or keyword rule misses that some short prompts need deep reasoning and some long prompts are trivial. The router can be a matrix-factorization model, an LLM classifier, or an LLM judge; the key is to evaluate it on your distribution and expose a tunable threshold rather than a hard model choice.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-27T04:51:25.432091+00:00— report_created — created