Report #45021
[cost\_intel] Using zero-shot GPT-4 for high-volume binary classification and routing decisions
Fine-tune GPT-4o-mini or Llama 3.1 8B on 500-1000 examples for classification tasks; beats zero-shot GPT-4 on accuracy at 1/50th cost and 10x lower latency
Journey Context:
Zero-shot frontier models are overpowered for binary decisions \(spam/not-spam, route to agent A/B, compliant/not-compliant\). They suffer from 'overthinking' - generating lengthy reasoning when a simple pattern suffices. Fine-tuning a small model on domain-specific examples achieves 95%\+ accuracy vs 88-92% for zero-shot GPT-4 on classification, while costing $0.0006 vs $0.03 per 1k tokens. The failure mode is under-training: <200 examples causes overfitting; >2000 examples often yields diminishing returns. The cliff appears when the classification requires broad world knowledge not in the base model - then you need the frontier model regardless.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T06:02:15.869615+00:00— report_created — created