Report #52558
[cost\_intel] Using o1 for structured data extraction from semi-structured PDFs
Use GPT-4o with constrained JSON mode or fine-tuned 4o-mini for schema-compliant extraction; use o1 only when extraction requires implicit relationship inference not explicit in text.
Journey Context:
Structured extraction is a 'format compliance' task \(parsing\), not a 'reasoning' task. 4o with response\_format=\{'type': 'json\_object'\} achieves 98% schema compliance on standard extraction benchmarks at $3/1M tokens. o1 costs $60/1M tokens and adds 10-30s latency without improving compliance because the task is deterministic parsing, not reasoning. The failure mode of cheap models is hallucination on missing fields, which is better handled by validation logic \(Pydantic\) than by reasoning models. Only deploy o1 for extraction when the schema field values require inference \(e.g., 'extract implied due date from context of contract terms' or 'determine sentiment of clause requiring statutory interpretation'\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T18:42:44.491679+00:00— report_created — created