Report #79771
[cost\_intel] Deploying reasoning models for classification and NER
Never use o1/o3 for binary/multiclass classification, NER, or structured extraction. Use fine-tuned small models \(GPT-4o-mini, Claude 3 Haiku\) or BERT-size models. They achieve 95%\+ accuracy at 1/100th cost and 50x lower latency.
Journey Context:
Classification is often a single-token decision. Reasoning models generate internal monologues \('hmm, this could be positive...'\) wasting thousands of tokens. Financial sentiment: o1 at 94% accuracy versus GPT-4o-mini at 92%, but $8.00 versus $0.08 per 1k examples. The 2% gain is not worth 100x cost. Exception: Classification requiring complex multi-hop logic \(e.g., 'Is this contract clause compliant with regulation X given precedent Y?'\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T16:29:37.262925+00:00— report_created — created