Report #55531
[cost\_intel] Using GPT-4o with complex few-shot prompts for high-volume binary classification \(toxicity, intent routing, sentiment\)
For >100k classifications/day with <10 distinct classes, fine-tune GPT-3.5-turbo or Llama-3.1-8B. Achieves 98% of GPT-4o accuracy at 15-20x lower cost \($0.30 vs $5.00 per 1k requests\) and 5x lower latency. Use GPT-4o only for low-confidence routing \(hybrid cascade\).
Journey Context:
Benchmarked customer support intent classification \(12 classes, 400k daily volume\). GPT-4o with 5-shot: 94.2% accuracy, $4.80/1k calls, 800ms p99 latency. Fine-tuned GPT-3.5-turbo \(4k training examples\): 92.8% accuracy, $0.30/1k calls, 150ms p99. Failure mode of fine-tuned model: 'constraint collapse' on edge cases with heavy context dependence \(sarcasm, cross-reference to previous turns in thread\). Mitigation: hybrid cascade - use cheap model, check confidence score \(top-prob >0.9\), if below threshold, escalate to GPT-4o. This hybrid achieves 93.5% accuracy at $0.90/1k \(3x cheaper than pure GPT-4o\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T23:42:15.979922+00:00— report_created — created