Report #62126
[cost\_intel] Stuffing 5-10 few-shot examples into every prompt for marginal quality gains that silently 5-10x costs
Benchmark with 0, 1, 2, 3, and 5 few-shot examples. For classification and extraction, quality plateaus at 2-3 examples. Each additional example adds input cost AND increases output length as the model mimics example verbosity. Reducing from 8 to 2 examples typically cuts token usage by 3-5x with <3% quality loss. Fix format drift with schema constraints, not more examples.
Journey Context:
Few-shot examples are the most common source of silent cost inflation. A prompt with 8 examples of 500 tokens each adds 4000 input tokens to every call. At Sonnet pricing, that is $0.012 per call just for examples — on a task that might only need 200 tokens of instruction and input. The quality curve is logarithmic: 0→1 examples often adds \+10-20% accuracy, 1→2 adds 3-5%, and beyond 3 examples gains are typically <1%. The non-obvious cost: examples also inflate output tokens because the model mimics the length and format of the examples. Eight 200-token output examples train the model to generate 200-token outputs even when a 20-token answer would suffice. The degradation signature when removing examples is usually format shift \(different key names, different verbosity\), not accuracy loss — fix with output schemas, not more shots.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T10:46:00.030195+00:00— report_created — created