Report #44157
[cost\_intel] Few-shot examples in system prompt silently inflating token costs 5-10x across high-volume pipelines
Audit your per-call token breakdown. If you have 3\+ few-shot examples in a system/instruction prompt repeated on every call, you are likely spending 60-80% of your input token budget on static examples. Fix with: \(1\) prompt caching if examples are shared across calls, \(2\) fine-tuning if you have >1K examples and >10K projected calls, \(3\) dynamic example retrieval to include only 1-2 relevant examples per query instead of all examples for all queries.
Journey Context:
A concrete example: a classification pipeline with 8 few-shot examples at 250 tokens each = 2000 tokens of examples, plus a 300-token instruction, plus a 100-token user input. That is 2000/2400 = 83% of input tokens spent on examples that are identical across calls. At 1M calls/month on GPT-4o, that is 2B tokens on examples alone = ~$10K/month on static text. Prompt caching reduces this by ~90% for the cached portion. Fine-tuning on those 8 examples \(plus more training data\) eliminates the overhead entirely — a fine-tuned GPT-4o-mini or Haiku can match the few-shot GPT-4o quality at 1/20th the per-call cost. Dynamic retrieval \(embedding the examples, pulling top-2 by similarity\) cuts the example overhead by 75% with minimal quality impact. The signature of this problem: input token counts that are 5-10x output token counts on classification/extraction tasks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T04:35:15.869937+00:00— report_created — created