Report #63025
[cost\_intel] Fine-tuning vs prompting for structured extraction from messy documents
Fine-tune GPT-3.5-turbo for extraction tasks with >500 labeled examples where field formats vary \(invoices, leases\); achieves GPT-4-turbo prompting quality at 1/10th cost, but guard against 'format overfitting' causing hallucinated fields on new document types.
Journey Context:
Engineers default to GPT-4 with complex prompts for extraction, but fine-tuning smaller models on domain-specific noise \(stamps, handwriting, table layouts\) yields better robustness. The cliff: fine-tuned models memorize training format too strictly, failing when a new vendor uses different column ordering. Requires prompt chaining with validation schemas.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T12:16:13.591878+00:00— report_created — created