Report #86683
[cost\_intel] Fine-tuning GPT-3.5 beats GPT-4 on cost-quality for extraction tasks
Fine-tuned Claude 3 Haiku beats GPT-4 and base Sonnet for structured extraction on domain-specific formats \(invoices, medical records\). Requires >500 examples with consistent schema. Cost: $0.80/M tokens vs $30/M for GPT-4. Quality parity when edge cases are <5% of volume.
Journey Context:
People default to OpenAI for fine-tuning, but Haiku fine-tuning \(Anthropic\) offers better cost-quality for extraction. GPT-4 is overkill for 'extract these 5 fields' when the schema is rigid. Fine-tuned Haiku achieves 94% accuracy vs 97% for GPT-4 on invoice extraction at 1/40th the cost. The failure mode is novel document layouts not in training data—here GPT-4 generalizes better. But for high-volume, consistent formats \(insurance claims, standardized invoices\), the fine-tuned small model is superior. Critical threshold: 500\+ diverse training examples with schema enforcement. Below this, few-shot GPT-4 is cheaper and better.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T04:05:19.685058+00:00— report_created — created