Report #43206
[cost\_intel] When is a simple embedding classifier 100x cheaper than GPT-4 with equal accuracy?
For binary or multi-class classification with <50 classes and static definitions \(e.g., 'spam/ham', 'refund/request/billing'\), embedding-3-small \+ cosine similarity beats GPT-4 Turbo. Cost: $0.02/1M tokens for embedding vs $10/1M output tokens for GPT-4. Latency: 50ms vs 2000ms. Accuracy: Within 2-3% F1 on clear category boundaries. The cutoff: If classes require reasoning \(e.g., 'sarcastic complaint' vs 'genuine complaint'\), embeddings fail. If categories are disjoint keywords, embeddings win at 1/500th the cost.
Journey Context:
Teams reach for LLMs for all classification because 'understanding' feels necessary. But text-embedding-3-small \(1536-dim\) captures semantic categories robustly for topic classification, intent detection, and spam filtering. The common error is using LLM few-shot when you have 10k\+ labeled examples — that's exactly when embeddings shine. The cost math: Embedding 1M tokens costs $0.02. GPT-4o-mini costs $0.60/1M input \+ $2.40/1M output. For classification, assume 500 input \+ 50 output tokens per sample. That's $0.0015 per sample for GPT-4o-mini vs $0.00001 for embedding \(500 tokens\). 150x cheaper. The quality cliff: Embeddings struggle with negation \('not a refund request'\) and hierarchical labels. Use a hybrid: Embedding for first-stage routing, small LLM for ambiguous cases.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T02:59:47.184747+00:00— report_created — created