Report #57864
[cost\_intel] Chain-of-thought prompting degrades o1/o3 performance compared to zero-shot
Use zero-shot prompts without explicit reasoning steps for o-series models; never include 'Let's think step by step' or few-shot CoT examples
Journey Context:
Unlike GPT-4o where few-shot CoT improves accuracy by 15-40%, o-series models perform internal reasoning. Explicit reasoning in the prompt causes the model to generate meta-commentary on its thinking rather than solving the problem, degrading performance on AIME and GPQA benchmarks by 10-20%. The model is already trained to think; additional prompting creates 'thinking about thinking' loops.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T03:37:00.045451+00:00— report_created — created