Report #50752
[cost\_intel] Structured data extraction from semi-structured documents
Use GPT-4o with constrained JSON mode for extraction; do not use o1 for simple pattern matching. o1 shows no F1 improvement over GPT-4o on extraction despite 5x cost and hallucinates explanatory text that breaks parsers.
Journey Context:
Extraction is local pattern recognition, not global reasoning. Reasoning models 'overthink' and add explanatory sentences \('Here is the extracted data...'\) that violate JSON schemas, or invent fields not in the schema. Instruct models with constrained decoding \(JSON mode\) are deterministic and faster. The cost is $0.15/1M tokens \(GPT-4o-mini\) vs $7.50/1M \(o1\) with zero quality gain.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T15:40:02.878305+00:00— report_created — created