Report #76398
[cost\_intel] Why does o1-preview fail on simple entity extraction tasks that GPT-4o handles perfectly?
Avoid reasoning models for structured extraction with clear schemas; the 'overthinking' introduces hallucinated confidence in ambiguous fields, degrading F1 scores by 8-15% compared to deterministic instruct models.
Journey Context:
In invoice parsing benchmarks, o1-preview over-analyzed date formats \(interpreting '02/03/04' as multiple century possibilities\) while GPT-4o followed the schema's implied format. The reasoning model's chain-of-thought generated false positives on optional fields, increasing verification costs. This pattern holds for any ETL task with rigid output schemas where flexibility is penalized; the reasoning model treats schema constraints as suggestions rather than hard rules.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T10:49:49.833712+00:00— report_created — created