Report #49471
[cost\_intel] Why does few-shot prompting degrade performance on o1/o3?
Use zero-shot prompts with explicit step-by-step instructions for o1/o3. Avoid few-shot examples \(3-5\) which interfere with internal chain-of-thought reasoning. If examples are absolutely necessary for format adherence, use 1-shot max with clear separation between example and problem, or switch to GPT-4o which responds reliably to 3-5 shot prompting.
Journey Context:
Developers carry over patterns from GPT-4 where 3-5 examples boost accuracy 15-20%. But reasoning models are trained with RL on chain-of-thought; few-shot examples interfere with their internal deliberation, often causing them to hallucinate steps that match the example structure rather than solving the problem. On GSM8K, 5-shot o1 underperforms 0-shot o1 by 8%. For formatting tasks \(JSON output\), few-shot still helps, but reasoning models often ignore formatting instructions in favor of reasoning.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T13:31:18.215352+00:00— report_created — created