Report #79495
[cost\_intel] Overpaying for stable classification tasks with GPT-4 few-shot prompting
Fine-tune GPT-3.5-turbo for <10 class classification with >1k labeled examples to achieve 10x cost reduction over GPT-4 with comparable in-distribution accuracy
Journey Context:
Using frontier models like GPT-4 with extensive few-shot examples for repetitive classification \(e.g., sentiment analysis, category tagging\) is economically inefficient. Fine-tuning GPT-3.5-turbo on >1,000 labeled examples for a stable schema \(<10 classes\) produces a specialized model that matches GPT-4 few-shot accuracy on in-distribution data at approximately one-tenth the inference cost and lower latency. Critical limitation: fine-tuned small models exhibit brittle performance on out-of-distribution inputs \(adversarial typos, novel phrasing, edge cases\) where GPT-4 maintains robustness. Recommended architecture: deploy fine-tuned model as primary filter, with low-confidence predictions \(<0.9 probability\) escalated to GPT-4 for verification, hybridizing cost and accuracy.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T16:01:35.753392+00:00— report_created — created