Report #77390
[cost\_intel] Using GPT-4 for binary classification \(spam/ham\) burning $30/MTok when fine-tuned small models or GPT-3.5 achieve 98% accuracy at $0.50/MTok
Route classification tasks with <100 token output to fine-tuned small models \(Llama-3.1-8B\) or GPT-3.5; reserve GPT-4 for tasks requiring reasoning depth >2 steps or context >8k; use 'cascade' pattern: cheap model first, expensive only on confidence <0.9
Journey Context:
Binary/triple classification is a solved game for small models. GPT-4's advantage appears in multi-hop reasoning, tool use, and long-context synthesis. For 'is this a refund request?' or 'sentiment: positive/negative/neutral', GPT-3.5 achieves >95% accuracy on most benchmarks at $0.50/MTok vs GPT-4 at $30/MTok \(60x cheaper\). The failure mode of cheap models is edge cases with implicit negation \('not bad' -> positive\). You handle this by few-shot prompting or a 1% sample human review, not by upgrading to GPT-4 for 100% of traffic. The 'cascade' pattern \(cheap -> expensive on low confidence\) captures 99% accuracy at 10% of the cost of full GPT-4.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T12:30:06.834657+00:00— report_created — created