Report #96709
[cost\_intel] Using frontier models with elaborate few-shot prompting for high-volume classification
For repetitive classification tasks exceeding 1000 examples/day \(security severity tagging, migration pattern detection\), fine-tune GPT-4o-mini or Haiku with 100-500 examples to beat GPT-4o few-shot performance at 1/20th the cost
Journey Context:
Few-shot prompting with frontier models wastes tokens repeating examples on every call—5-shot examples can add 2k\+ tokens per request. Fine-tuning bakes patterns into weights; subsequent calls use zero-shot prompts, drastically reducing tokens. Break-even is ~1000 requests/day. Quality curve: fine-tuned small model matches few-shot large model on narrow tasks but fails on distribution shift. The token bloat from few-shot can silently 10x costs compared to fine-tuned inference.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T20:54:44.116754+00:00— report_created — created