Report #54447
[cost\_intel] Using few-shot frontier models for high-volume repetitive style tasks
For brand voice rewriting or format conversion tasks requiring >10k requests/month with consistent output structure, fine-tune GPT-3.5-turbo on 500-1000 examples rather than using few-shot GPT-4. The fine-tuned model achieves 15% higher style adherence scores at 1/5th the inference cost, breaking even at ~8k requests/month including training costs.
Journey Context:
Teams default to GPT-4 with 5-shot examples for 'quality,' but few-shot adds latency and token costs \(input tokens dominate\). Fine-tuning seems expensive upfront \($30-100 training\) but eliminates the need for long prompts. The quality cliff: fine-tuned small models outperform frontier few-shot on narrow distributions \(single task, consistent input format\) but fail on out-of-distribution inputs. Common error: fine-tuning on too few examples \(<200\) or not validating that the task is truly narrow. For rewriting tasks with strict brand guidelines, the fine-tuned model learns the implicit rules \(e.g., 'never use passive voice'\) that would require 10\+ shots to teach via prompting.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T21:53:05.809923+00:00— report_created — created