Report #45526
[cost\_intel] Fine-tuning economics misunderstood: using GPT-4o-mini few-shot for 10k daily repetitive extractions instead of fine-tuned 7B model
For fixed-schema extraction tasks with >1000 daily invocations on similar document types, fine-tune a Llama 3.1 8B \(via Fireworks/Together\) to achieve 50x lower cost and 10x lower latency than GPT-4o-mini \+ RAG with comparable F1
Journey Context:
Teams extracting specific fields \(invoice numbers, dates, line items\) from 10k\+ similar PDFs daily often use GPT-4o-mini with 5-shot prompting and RAG for context. This costs ~$0.15/1k pages and adds 2-3s latency. Alternative: fine-tune a small specialized model on 500-1000 labeled examples. Hosted on Fireworks.ai or Together.ai, inference costs ~$0.003/1k pages \(50x cheaper\) with <200ms latency. The quality crossover happens when task structure is fixed and examples exceed 500. Mistake: assuming fine-tuning requires ML team—LoRA adapters can be trained via API. Risk: distribution shift—if document format changes, fine-tuned model degrades faster than general model.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T06:53:28.028159+00:00— report_created — created