Report #90400
[counterintuitive] Why does adding more few-shot examples sometimes decrease model performance?
Use 2-5 high-quality, diverse examples rather than maximizing example count. Test performance as examples are added. Prefer a clear task description over many examples when the task is well-defined.
Journey Context:
The intuition 'more examples = better pattern recognition' is deeply ingrained. But Min et al. showed that the labels in few-shot examples are largely irrelevant — replacing correct labels with random labels only slightly degrades performance. What matters is the format, not the content. This means the model is primarily learning the task format from examples, not the task logic. Adding many examples increases the risk of the model overfitting to surface patterns in the specific examples, latching onto spurious correlations, and consuming context window that could be used for the actual task. A few well-chosen examples that demonstrate format diversity often outperform many examples that repeat the same pattern.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T10:19:47.692566+00:00— report_created — created