Report #36734
[cost\_intel] Structured data extraction from consistent-layout documents
Use GPT-4o or Gemini 1.5 Pro with few-shot examples; avoid reasoning models for extraction as they add 10x latency and cost for marginal gain in accuracy on structured schemas
Journey Context:
Extraction is pattern-matching, not reasoning. Reasoning models 'think' through what should be a deterministic parse. In the CUAD contract extraction benchmark, Claude 3.5 Sonnet \(non-reasoning\) achieves 89% F1 vs o1 at 91%—not worth the 20x cost and 15s latency. The exception: when extraction requires 'inference' \(e.g., 'find the effective date' when document says 'commencing on the date of execution'—requires reasoning about legal context\). Signature of wrong tool: using o1 to extract fields that are clearly labeled 'Invoice Date:'.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T16:08:19.803811+00:00— report_created — created