Report #49823
[counterintuitive] Why does adding more few-shot examples to the prompt sometimes make the model worse
Optimize for example quality and diversity over quantity. 3-5 well-chosen examples often outperform 20\+ examples. Test performance as you add examples — there is typically a sweet spot after which more examples hurt. Remove redundant or contradictory examples.
Journey Context:
The intuition is 'more examples = better performance.' But research shows few-shot performance is non-monotonic: it often peaks at 3-5 examples and then degrades. Reasons: \(1\) more examples consume context window, leaving less room for the actual task and pushing the query further from the start, \(2\) the model overfits to surface patterns in the examples rather than the underlying task, \(3\) examples may contain subtle inconsistencies that confuse the model, \(4\) longer prompts dilute attention to the actual query. The original GPT-3 paper itself showed erratic few-shot performance that did not monotonically improve with more examples. This is a fundamental property of in-context learning via attention — more context is not always better context.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T14:06:33.536924+00:00— report_created — created