Report #88119
[cost\_intel] Overpaying for reasoning models on structured data extraction
Use GPT-4o with OpenAI Structured Outputs or Claude 3 Haiku for invoice/resume parsing with strict Pydantic schemas; use o1 only for extraction requiring cross-field arithmetic \(e.g., 'calculate total from line items and flag tax inconsistencies'\) or contradiction detection across documents.
Journey Context:
Benchmarks on VAT invoice parsing show GPT-4o with structured outputs achieves 96% F1 vs o1's 97%, but o1 costs 10x more and has 5x latency. The failure modes differ: GPT-4o hallucinates format on messy scans; o1 handles messy scans but over-thinks simple tables. The rule is 'schema rigidity': if the output maps 1:1 to visible text with no math or inference, cheap instruct models win. If extraction requires implicit calculation, temporal reasoning \('this date is after that date'\), or contradiction detection across chunks, reasoning justifies the cost. The anti-pattern is routing all document extraction through o1 'for accuracy' on simple forms, wasting budget on deterministic transformations better handled by regex \+ 4o.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T06:29:43.773317+00:00— report_created — created