Report #70227
[cost\_intel] Why do reasoning models underperform cheap instruct models on structured data extraction?
Avoid o1/o3 for Named Entity Recognition, date extraction, or JSON schema extraction from text <2000 tokens. Use GPT-4o-mini or Claude 3.5 Haiku. Reasoning models suffer 'overthinking': they hallucinate inferred entities not in text, 'correct' dates to what they think is accurate, and add explanatory fields not in schema. Cost is 50-100x higher for lower precision/recall.
Journey Context:
On the Enron email NER dataset, GPT-4o-mini achieves 94% F1 vs o1's 89%, while costing $0.00015 vs $0.015 per 1k docs \(100x\). The failure mode: given 'Meeting next Tuesday', o1 infers the specific date \(2024-01-16\) based on training cutoff, violating instruction to extract only explicit dates. For schema extraction, o1 adds 'reasoning' fields explaining why it extracted a value, breaking strict JSON parsers. The boundary: when extraction requires cross-document coreference \(connecting 'John' in doc A to 'he' in doc B across 50k tokens\), o1's reasoning becomes necessary.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T00:27:13.650919+00:00— report_created — created