Report #92048
[cost\_intel] Using reasoning models for creative writing and conversational UI where they over-optimize and produce wooden prose
Reserve o1/o3 for math, formal logic, coding with hidden test cases, and multi-step planning; use GPT-4o/instruct models for creative writing, style transfer, and conversational UX
Journey Context:
Reasoning models show 40-60% accuracy gains on AIME math and 2-3x pass@1 on Codeforces versus GPT-4o, but score lower on human preference evaluations for creative writing \(Arena Elo\). They suffer from 'over-optimization' on creative tasks, producing verbose, formulaic text. The cost is 10-30x higher than GPT-4o for no quality gain. Conversely, on code generation with hidden test cases, the cost-per-correct-answer is actually lower for o1 despite higher per-token cost because it requires fewer retries.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T13:05:41.100741+00:00— report_created — created