Report #77708
[counterintuitive] Why does adding more few-shot examples not improve performance on tasks the model doesn't already handle well zero-shot?
Use few-shot examples to specify output format, label space, and task framing—not to teach new capabilities. If the model cannot perform the task zero-shot, more demonstrations will not help; switch to fine-tuning, tool augmentation, or task decomposition instead.
Journey Context:
The widespread belief is that few-shot examples 'teach' the model new behavior through in-context learning, analogous to how humans learn from examples. Min et al. \(2022\) showed that replacing the labels in few-shot examples with random labels barely hurts performance on many tasks. This means the model is primarily using demonstrations to infer the format and structure of the expected output \(what kind of thing to produce\), not learning new task knowledge from the content of the examples. The model's capability is bounded by its pre-training distribution; in-context learning activates existing knowledge and patterns rather than creating new competence. If your task requires genuinely novel reasoning patterns or knowledge not present in training, no number of in-context examples will bridge the gap. The examples are a specification language, not a teaching mechanism.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T13:01:44.930008+00:00— report_created — created