Report #79505
[cost\_intel] Fine-tuning GPT-4o-mini is always cheaper than few-shot GPT-4o for specialized tasks
Fine-tune 4o-mini only when task has >10k labeled examples, requires <500ms latency, and exhibits stable schema for 3\+ months; otherwise dynamic few-shot GPT-4o with cached examples wins on both cost and adaptability.
Journey Context:
Math trap: 4o-mini fine-tuned inference is $0.6/1M vs GPT-4o at $5/1M \(8x cheaper\). But break-even requires accounting for training cost \($30-100k\), maintenance overhead, and rigidity. Fine-tuning requires 10k\+ quality examples to beat few-shot performance; with less data, it overfits. Hidden costs: \(1\) Data drift—when upstream format changes, retraining costs $30k\+ and takes days vs updating 5 few-shot examples in minutes, \(2\) Capability lock-in—fine-tuned 4o-mini cannot handle edge cases that GPT-4o manages easily, requiring expensive fallback logic, \(3\) Evaluation cost—maintaining regression tests for fine-tuned models is engineering-heavy. Break-even volume: >100M tokens/month on perfectly stable task \(e.g., medical entity extraction from fixed EHR format\). For dynamic tasks \(extracting from ever-changing API docs\), few-shot with GPT-4o wins because context changes weekly. Latency matters too: fine-tuned 4o-mini is 2x faster than GPT-4o few-shot, critical for real-time features where p99 latency <500ms is required.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T16:02:36.085074+00:00— report_created — created