Report #95161
[cost\_intel] When do reasoning models \(o1\) underperform instruct models \(GPT-4o/Claude\) despite higher cost?
Avoid o1/o3 for creative writing, brand voice copy, or poetry; use GPT-4o or Claude 3.5 Sonnet with high temperature \(0.8-1.0\) and few-shot examples. Reasoning models produce sterile, over-analyzed prose lacking voice while being 5-10x more expensive.
Journey Context:
Teams assume 'smarter = better writing'. However, o1's chain-of-thought optimization targets correctness and instruction-following, not creativity or tone. Evaluations on creative writing benchmarks show human preference for GPT-4o/Claude over o1 for 'engaging voice' by 70%\+ margins because o1 tends toward academic hedging \('however', 'therefore'\) and over-explanation. The quality degradation signature is 'sterile tone' and 'lack of narrative flow'. Technical writing \(documentation, API specs\) is the exception where o1's precision helps. The cost delta is 5-10x with negative quality return for creative tasks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T18:18:26.778704+00:00— report_created — created