Report #55895
[cost\_intel] When do reasoning models underperform cheap instruct models for structured data extraction?
Use GPT-4o-mini or fine-tuned small models \(7B\) for flat schema extraction \(<3 nesting levels\); reserve o3 for deeply nested conditional schemas requiring calculation \(e.g., 'if field X > 100, derive field B as X\*0.15'\).
Journey Context:
Reasoning models suffer from 'helpfulness hallucination' in extraction: they add fields that 'should' exist based on world knowledge, violating strict schema constraints. On FUNSD or DeepForm datasets, GPT-4o-mini achieves higher exact-match JSON scores than o3 because o3 reorders keys, adds descriptive comments, or infers values not in text. The cost delta is 50-100x. The exception is when extraction requires arithmetic or logical deduction across fields \(e.g., calculating totals from line items with tax rules\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T00:18:43.208188+00:00— report_created — created