Report #88142
[cost\_intel] Should I use embeddings or LLM calls for query classification and routing?
Use embedding similarity for coarse routing \(>$0.0001/query\), LLM for fine-grained intent classification. Hybrid routing uses 1 embedding \+ 1 small LLM call \(Haiku\) vs 1 large LLM call \(Sonnet\), saving 80% cost with 95% routing accuracy.
Journey Context:
Routing decisions \(which retriever/tool to use\) are often made by sending the full query to a large model \(Sonnet/GPT-4\). Cost: $0.003-0.015 per query. Embedding-based routing using vector similarity to labeled examples costs $0.0001 per query \(100x cheaper\) but fails on nuanced queries \('compare X and Y' vs 'summarize X'\). The hard-won insight: Use embedding for coarse filtering \(selecting between 3-5 major categories\), then a small model \(Haiku/Flash\) for binary/n-ary decisions within the category. Total cost $0.0005 vs $0.003 \(6x savings\), and Haiku is >95% as accurate as Sonnet on intent classification tasks with clear category definitions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T06:31:48.255345+00:00— report_created — created