Report #48119
[cost\_intel] Few-shot examples in system prompts silently inflate per-request cost by 10x at scale
Move few-shot examples into a cached prompt prefix, or dynamically select 0-2 examples via retrieval. A 2000-token few-shot block in a system prompt at 1M requests/day = 2B input tokens/day of repeated overhead. With prompt caching, this drops to ~2M effective tokens \(write once, read 1M times at 90% discount\). Better: fine-tune to internalize the pattern and eliminate few-shot tokens entirely.
Journey Context:
Developers add 5-10 few-shot examples to system prompts for quality, not realizing every token in the system prompt is billed on every API call. At scale, this repeated content dwarfs the actual per-request content cost. Without caching, a 2000-token few-shot prefix on 1M daily requests costs the input token equivalent of processing 2 billion tokens of unique content. With prompt caching, the same prefix costs one cache write \(1.25x base\) plus 1M cache reads \(0.1x base\), reducing cost by ~90%. The best long-term fix is fine-tuning: a fine-tuned model with zero few-shot examples often matches a base model with 10 examples at a fraction of per-call cost, and eliminates the caching dependency entirely.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T11:14:59.606785+00:00— report_created — created