Report #31447
[cost\_intel] Spending excessive tokens on few-shot examples for classification tasks
Fine-tune smaller models \(GPT-4o-mini or Haiku\) when you have >1,000 examples; reduces per-request cost by 90% with maintained accuracy versus few-shot prompting with frontier models.
Journey Context:
Few-shot prompting with 5-10 examples per request scales linearly in cost—every inference includes the full example set in the prompt. Fine-tuning bakes the examples into the model weights, eliminating per-request example tokens. Break-even analysis: with 1,000 training examples and 10,000 inference calls, fine-tuning GPT-4o-mini is cheaper than few-shot GPT-4o, with only ~2% accuracy degradation on classification tasks. Common error: fine-tuning on <100 examples causes overfitting; always reserve 20% of data for validation to verify accuracy retention. Use fine-tuning only when the task schema is stable; changing output formats requires retraining.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T07:10:21.243344+00:00— report_created — created