Report #68716
[cost\_intel] Using expensive frontier models with complex prompting to enforce specific output formats or tones
Fine-tune GPT-3.5-turbo or Haiku once you have >500 high-quality examples of the target style/format. Fine-tuned small models beat zero-shot GPT-4 on format adherence at 1/20th the cost \($0.003 vs $0.06 per 1k output tokens\). Measure: win rate on blind human evaluation or automated format checker.
Journey Context:
Teams think fine-tuning is for 'custom knowledge' - actually, it's cheapest for 'custom format.' The error is trying to get GPT-4 to output legal briefs or medical notes in exact institutional templates using 500-word system prompts. That burns tokens on every call. Fine-tuning bakes the format into the weights; inference becomes cheap and fast. The threshold: 500 examples is the cliff - below that, use few-shot prompting. Above 5k examples, you might need parameter-efficient fine-tuning \(LoRA\) on larger models. Watch for overfitting: if the fine-tuned model ignores novel inputs in the format, you've overfit the training examples.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T21:49:18.093793+00:00— report_created — created