Report #91640
[cost\_intel] Reasoning models underperform on simple extraction due to overthinking
Never use o1/o3 for regex-level extraction tasks \(e.g., 'find all emails in this text'\). They hallucinate false positives by inventing complex validation rules, achieving 85% accuracy vs 99% for GPT-4o at 20x the cost and latency.
Journey Context:
There's a U-shaped performance curve: reasoning models are worse than base models on trivial tasks because they apply chain-of-thought unnecessarily. On email extraction, o1 tries to validate if emails are 'real' or 'temporary domains' and rejects valid RFC-compliant emails. This is 'overthinking' or 'reward hacking' on length. The signature of this degradation is increased hallucinations on simple patterns and much longer outputs explaining the reasoning. Use the simplest model that fits the complexity class.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T12:24:33.585760+00:00— report_created — created