Report #64255
[cost\_intel] Fine-tuning vs prompting cost break-even for strict output formatting tasks
Fine-tune GPT-3.5-turbo for structured generation tasks \(SQL, specific JSON schemas\) once you have >1,000 high-quality examples. Fine-tuned 3.5-turbo beats GPT-4-turbo prompting on format adherence at 1/10th the cost \($0.003 vs $0.03 per 1K tokens\), but fails on out-of-distribution inputs where GPT-4 generalizes.
Journey Context:
Teams assume larger models are always better for formatting, but fine-tuned smaller models learn the exact output distribution and rarely hallucinate schema violations. The break-even is around 1,000 examples; below that, few-shot prompting with GPT-4 is more robust. Quality degradation signature: fine-tuned small model loses flexibility on edge cases not in training data, producing 'confident nonsense' on novel inputs while GPT-4 asks clarifying questions or admits uncertainty.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T14:20:35.484058+00:00— report_created — created