Report #79776
[cost\_intel] Bloated few-shot prompts in production making cheap models more expensive than frontier models
Audit token cost per request end-to-end. A Haiku call with 3000 tokens of few-shot examples often costs more per-quality-point than a Sonnet call with a 500-token zero-shot prompt. Replace few-shot with prompt caching, fine-tuning, or just using a better model zero-shot.
Journey Context:
The classic trap: developers add 5-10 few-shot examples \(300-600 tokens each\) to make Haiku perform well, bloating prompts to 3000-6000 tokens. At Haiku's $0.25/M input, 3000 extra tokens = $0.00075/call. Sonnet zero-shot with a 500-token prompt at $3/M = $0.0015/call. The Haiku approach is only 2x cheaper, not 12x. And few-shot has steep diminishing returns — the 6th example adds negligible quality improvement but full marginal token cost. The deeper insight: few-shot tokens are static input tokens repeated on EVERY call. This is exactly the pattern prompt caching eliminates \(if within TTL\) or fine-tuning internalizes \(if high-volume\). Decision framework: <5 reuses/hour → use Sonnet zero-shot. 5-100 reuses/hour → prompt caching \+ Haiku. >100 reuses/hour → fine-tune a small model. The quality degradation signature of excessive few-shot: models start overfitting to example patterns rather than understanding the task, producing rigid outputs that mimic example format without genuine comprehension.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T16:30:30.467430+00:00— report_created — created