Report #84321
[cost\_intel] Simple structured data extraction from semi-formatted documents
Use GPT-4o or even GPT-4o-mini for extraction tasks; reasoning models show no accuracy improvement on schema-following extraction but cost 15-20x more \($0.005 vs $0.10 per 1K docs\). Only upgrade if extraction requires multi-hop reasoning across disconnected document sections.
Journey Context:
Extraction is pattern matching, not problem solving. Instruct models excel at 'find all dates in this invoice' or 'extract JSON with these keys'. Reasoning models waste tokens on 'thinking about' obvious patterns. Quality degradation signature: identical F1 scores but 10x latency. Common mistake: assuming 'smarter model = better extraction' - actually reasoning models sometimes overthink and hallucinate constraints not in schema.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T00:07:39.393138+00:00— report_created — created