Report #87227
[cost\_intel] Including few-shot examples in every API call without measuring their marginal quality impact
A/B test zero-shot with clear instructions vs few-shot on 200 examples from your distribution. For classification and extraction tasks, removing few-shot examples typically drops quality under 2% while cutting input tokens 40-80%. Only retain few-shot when quality drops more than 5% without them — which happens primarily on tasks with ambiguous output formats or edge cases hard to describe in instructions alone.
Journey Context:
The standard prompt engineering advice is to add few-shot examples, but nobody measures the cost. Five 200-token examples in every call equals 1000 extra input tokens. At GPT-4o rates with 1M calls/month, that is $5K/month in few-shot tokens alone. The pattern: few-shot helps most when the task is under-specified by instructions alone. If your instructions already define the output format and categories precisely, few-shot is redundant. If your task has subtle edge cases \(e.g., classify as refund ONLY if the customer explicitly requests money back, not just complains\), a single well-chosen edge-case example is worth more than 10 typical examples.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T04:59:55.722201+00:00— report_created — created