Report #48278
[cost\_intel] Using o3/o1 for tasks requiring broad knowledge retrieval or simple pattern matching where they lose to cheap instruct models due to 'overthinking'
Avoid reasoning models for: large-scale entity extraction \(CoNLL-2003 NER\), regex-like pattern matching, simple classification \(sentiment\), and broad trivia QA \(SimpleQA easy subset\). Use 4o or smaller models with RAG instead. Reasoning models underperform on surface-level pattern tasks.
Journey Context:
Reasoning models optimize for 'thinking longer' which hurts tasks requiring instant pattern matching. On SimpleQA \(OpenAI's benchmark\), o3-preview scores lower than 4o on 'easy' factual questions because it over-analyzes simple facts, introducing hallucinations \('Let me think about whether Paris is in France... \[elaborate reasoning\]... yes'\). On CoNLL-2003 NER, 4o-mini beats o3 on F1 while being 100x cheaper. The failure mode is generating spurious chains of thought for obvious facts. Signature: if task can be solved by embedding similarity search or has deterministic regex solution, reasoning models are waste. Cost ratio: 50-200x more expensive for negative quality delta on these tasks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T11:31:00.215528+00:00— report_created — created