Report #99566
[cost\_intel] Few-shot examples hurt reasoning model performance on medical and reasoning benchmarks
Use zero-shot prompts with clear goals, constraints, and an explicit output contract for o1/o3/o4/GPT-5 reasoning models. Reserve few-shot prompting for instruct models like GPT-4o where examples provide format guidance. With reasoning models, supply rubrics, desired output format, and verification criteria instead of solved examples.
Journey Context:
Reasoning models are trained with RL to generate internal chain-of-thought. In-context examples can anchor them to surface patterns from the demonstrations rather than triggering their learned reasoning process. The Medprompt-to-o1 paper found that five-shot prompting significantly decreased o1-preview performance on MedQA, and OpenAI's reasoning guide explicitly recommends clear goals and constraints over few-shot examples. This is the opposite of instruct models, where few-shot often provides the biggest gain. The cost mistake is paying 10-40x for a reasoning model and then prompting it like GPT-4o with lengthy examples that add tokens and degrade accuracy.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-29T05:21:25.427698+00:00— report_created — created