Report #98176
[cost\_intel] Why do reasoning models underperform on simple questions despite the cost?
For commonsense, simple classification, translation, and loosely constrained Q&A, use a non-reasoning instruct model. Reasoning models can produce outputs ~15x longer and still be less accurate on System-1 tasks because they over-verify and over-explore.
Journey Context:
S1-Bench evaluates simple, intuitive tasks that 7-9B instruct models answer robustly. Large reasoning models average 15.5x longer outputs, often reach the correct answer early but continue redundant reasoning, and show accuracy degradation compared with traditional LLMs. The same pattern appears in overthinking benchmarks: on simple queries, thinking models generate hundreds of tokens for no accuracy gain. The practical signal is a short user query with an obvious answer where the reasoning model responds with a multi-paragraph derivation and still gets the format wrong. Route by estimated difficulty: if a cheap model answers confidently, don't escalate.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-26T05:21:37.919289+00:00— report_created — created