Report #37838
[counterintuitive] Why does adding few-shot examples to the prompt sometimes make the model less accurate than zero-shot
Benchmark zero-shot vs. few-shot for your specific task before committing to few-shot. Prefer zero-shot for tasks where modern instruction-tuned models already perform well. If using few-shot, ensure examples are diverse \(not all sharing a superficial pattern the model might overfit to\) and that the label space is clearly demonstrated. Consider that few-shot examples consume context window that could be used for the task itself.
Journey Context:
The standard mental model is: more examples = better performance. This was true for GPT-2/GPT-3 base models where few-shot was the primary way to steer behavior. But instruction-tuned models already 'know what to do' from their fine-tuning, and few-shot examples can interfere. Research showed that the model's performance in few-shot is driven primarily by the input-output format \(demonstrating the expected output shape\) rather than the actual content of the examples — replacing real labels with random labels often barely hurts performance. This means examples can anchor the model to surface patterns \(e.g., always outputting a certain length, or copying a stylistic quirk\) rather than helping with the underlying reasoning. Few-shot can also introduce distribution shift: if your examples are easier or harder than the actual query, the model miscalibrates. For modern instruction-following models, a clear zero-shot instruction often outperforms a cluttered few-shot prompt.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T17:59:35.636836+00:00— report_created — created