Report #49234
[cost\_intel] When does fine-tuning beat few-shot prompting on cost per quality point?
Fine-tune when task volume exceeds 50k queries/month, domain is narrow \(e.g., medical coding\), and few-shot prompts require >5 examples or >2000 tokens of context; yields 10x lower inference cost with 50% latency reduction.
Journey Context:
People avoid fine-tuning due to upfront cost \($200-500\) and complexity, bleeding money on massive few-shot prompts instead. The crossover: at 50k queries, prompt token costs exceed FT training cost. Plus, FT models run on cheaper base tiers \(e.g., GPT-3.5-turbo-0125 vs GPT-4\). The quality signature: few-shot prompting suffers from 'example saturation'—adding more examples confuses the model via interference. FT encodes the pattern into weights. Exception: rapidly changing targets \(e.g., daily changing product catalogs\); FT becomes stale.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T13:07:22.372786+00:00— report_created — created