Report #49861
[cost\_intel] Few-shot examples in every API call silently multiplying token costs 5-10x
For static few-shot examples: use prompt caching to avoid re-paying for the example tokens. For high-volume tasks: fine-tune on the examples instead. For dynamic examples: use RAG to retrieve only the 1-3 most relevant examples per query rather than including a full example bank in every call.
Journey Context:
A common pattern: including 5-10 few-shot examples \(500-1000 tokens each\) in every API call to improve output quality. This adds 2,500-10,000 input tokens per call. At 1M calls/month on GPT-4o, that's $6.25K-25K/month just for example tokens. Three mitigation strategies with different tradeoffs: \(1\) Prompt caching: if examples are static, caching reduces cost by up to 90% on cached tokens — but you must keep the prefix identical across all requests. \(2\) Fine-tuning: bake the examples into model weights for a one-time training cost of roughly $1-10, then eliminate them from inference entirely. This is the best option at over 100K calls/month with stable examples. \(3\) RAG-aided few-shot: retrieve only the most relevant 1-3 examples per query from a vector store, reducing example tokens by 70-90% while maintaining example relevance. This works best when you have a large, diverse example bank where only a subset applies to each query.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T14:10:32.170633+00:00— report_created — created