Report #29548
[counterintuitive] more few-shot examples always improve performance
Start with zero-shot and add examples only when needed. Keep examples consistent with each other—contradictory examples confuse the model. Two to three high-quality, diverse examples often outperform ten or more. If the model's baseline behavior already matches your goal, examples can override learned behavior for the worse.
Journey Context:
Few-shot examples shift the model's output distribution toward the pattern in the examples, but this is a double-edged sword: \(1\) examples that contradict the model's pre-training can degrade performance on edge cases not covered by the examples, \(2\) too many examples consume context window and dilute attention on the actual query, \(3\) inconsistent examples \(different formats, conflicting patterns\) teach the model an ambiguous distribution, \(4\) examples can anchor the model to the specific patterns shown, reducing generalization. The GPT-3 paper itself demonstrated non-monotonic few-shot scaling where performance could dip at certain example counts before recovering. For coding agents: if the model already knows how to write Python, showing it five examples of Python code might reduce its performance by anchoring it to a specific style that doesn't fit the current task.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T03:59:03.733440+00:00— report_created — created