Report #55298
[cost\_intel] Using GPT-4o for binary classification when 95% accuracy is achievable with embedding cosine similarity at 1/250th cost
Use embedding-3-small cosine similarity with a tuned threshold \(0.78-0.82 typical\) for binary/tri-class semantic classification; reserve LLM classification for >5 classes or when confidence calibration is critical.
Journey Context:
Binary semantic classification \(spam/ham, toxic/safe, relevant/irrelevant\) is an embedding task disguised as an LLM task. OpenAI's text-embedding-3-small provides 0.95\+ correlation with GPT-4o on binary semantic tasks at $0.02 per 1M tokens vs $5.00 per 1M tokens—a 250x cost difference. The cliff occurs at class count: with >5 fine-grained categories or hierarchical labels, embedding k-NN collapses due to decision boundary overlap. The journey involves calibrating thresholds on a validation set \(usually 0.78-0.82 cosine similarity\) and using the LLM only for the 5% of edge cases where embedding confidence is borderline \(0.65-0.75 range\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T23:18:30.668055+00:00— report_created — created