Report #45705
[cost\_intel] High latency and token costs from long few-shot prompts in high-volume classification tasks
Fine-tune a small model \(GPT-4o-mini or Llama-3.1-8B\) when classification volume exceeds 100k requests/month with >5 examples per prompt; break-even at ~50k calls due to 10x input token reduction
Journey Context:
Few-shot with 10 examples in context works well for accuracy but burns tokens \(500-1000 per call\). Fine-tuning bakes the examples into weights. The cost is upfront training \($20-50\) and slightly lower accuracy on edge cases \(distribution shift\). The cliff is when classes change frequently \(fine-tuning lag\) or when you need calibration \(fine-tuned models can be overconfident\). For stable categories \(support ticket routing, content moderation\), fine-tuning wins.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T07:11:37.149629+00:00— report_created — created