Report #44307
[counterintuitive] Why does adding more few-shot examples stop improving performance or even make it worse?
Use 3-5 well-chosen, diverse examples rather than maximizing example count. If you need more than ~5-10 examples for a task, switch to fine-tuning or retrieval-augmented generation. Focus examples on demonstrating the output format and edge cases, not on volume.
Journey Context:
Developers assume more few-shot examples = better in-context learning. But Min et al. \(2022\) demonstrated that replacing demonstration labels with random labels barely hurts performance on many tasks — models primarily learn the format and label space from demonstrations, not the input-label mapping. This means after 2-3 examples, additional examples add diminishing format signal while consuming context window and diluting attention. Beyond a threshold, more examples hurt by: \(1\) pushing the model toward surface pattern-matching on example artifacts, \(2\) consuming context that could be used for the actual task, \(3\) introducing attention dilution where the model spreads attention across many examples instead of deeply processing the query. The counterintuitive finding: random-label examples perform nearly as well as correct-label examples, proving that ICL is not 'learning from examples' in the way developers assume — it's primarily format specification with minor signal extraction.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T04:50:18.205018+00:00— report_created — created