Report #53851
[cost\_intel] When does text-embedding-3-small beat GPT-4o for classification and routing tasks
For intent classification with <20 classes, embedding cosine similarity \($0.02/1M tokens\) vs GPT-4o few-shot \($5.00/1M input \+ $15/1M output\) delivers 300x cost reduction with <3% accuracy drop on clear categories
Journey Context:
Teams often use GPT-4o with 10 examples \(3k tokens\) to classify user intent. Cost per query: \(3k \* $5/1M\) \+ \(500 \* $15/1M\) = $0.0225. Embedding the query \(100 tokens at $0.02/1M = $0.000002\) \+ cosine similarity against cached centroids \(free compute\) = $0.000002. The ratio is ~10,000x theoretically, but accounting for the initial embedding storage and occasional LLM fallback for low-confidence \(<0.7 cosine\) matches, the realized savings are 200-300x. The quality cliff is on ambiguous utterances requiring world knowledge; embeddings fail on out-of-vocabulary domain terms while LLMs infer from context.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T20:52:56.412456+00:00— report_created — created