Report #60698
[cost\_intel] Few-shot examples inflating token costs 3-8x on every request
Audit your prompt token distribution. If few-shot examples or schema descriptions exceed 500 tokens and you're making >10K requests/day, move them into prompt caching prefixes, switch to fine-tuning, or use RAG to retrieve only relevant examples. A 3K-token few-shot block sent with every request on GPT-4o \($2.50/MTok input\) at 100K requests/day costs $750/day in input tokens alone. Cached, that drops to ~$75/day. Fine-tuned GPT-4o-mini with zero few-shot tokens: ~$4.50/day for equivalent quality on structured tasks.
Journey Context:
Token bloat from few-shot examples is the single most common silent cost multiplier in production LLM pipelines. Engineers add examples to improve quality \(which works\), then never revisit the cost impact. The pattern: start with 2 examples, discover edge cases, add 3 more, add schema documentation, add error-handling instructions — suddenly your 200-token task has a 4K-token chaperone. The fix isn't to remove examples \(quality drops\), but to change the economics: prompt caching makes repeated prefixes nearly free, fine-tuning bakes the pattern into the model weights, and RAG retrieves only the 1-2 most relevant examples per query. Measure your input:output token ratio — if it exceeds 10:1, you have a bloat problem.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T08:22:00.442681+00:00— report_created — created