Report #47835
[cost\_intel] Few-shot examples silently 10x-ing API costs with minimal quality gain beyond 2-3 examples
A/B test reducing few-shot examples systematically. For most classification and extraction tasks, quality plateaus at 2-3 examples. Remove examples from the bottom one at a time and measure. If you need more than 5 examples for stable quality, that's a signal to fine-tune instead.
Journey Context:
The common anti-pattern: someone adds 10 examples to improve quality by 2%, not realizing those examples add 2000\+ tokens to every single request. At millions of calls, this is thousands of dollars for negligible gain. The math: 10 examples × 200 tokens each = 2000 extra input tokens per call. At $3/1M input tokens \(Sonnet\), that's $0.006/call extra. At 1M calls/day, that's $6K/day or $2.2M/year for a 2% quality bump. The fix is to systematically test: start with 0 examples, add 1, measure, add another, measure. The plateau is almost always at 2-3. If quality keeps climbing with more examples, your prompt instructions are insufficient — fix the instructions, don't pad with examples.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T10:46:44.824170+00:00— report_created — created