Report #41606
[cost\_intel] Latency and cost bottlenecks using few-shot prompting for high-volume binary classification
Fine-tune GPT-4o-mini or Haiku on 500-1000 examples of the desired output format; it reduces per-request token costs by 60% and eliminates the need for 1k tokens of few-shot examples in the prompt
Journey Context:
Few-shot prompting for format adherence requires 3-5 full examples \(500-1000 tokens\) to constrain the model. Fine-tuning bakes the format into the model weights, allowing zero-shot prompts like 'Extract to CSV'. At 1M requests/day, 1k tokens of examples = 1B tokens/day = $5,000/day \(GPT-4o-mini rates\). Fine-tuning costs $200-300 upfront and then reduces per-request token count by 1000, saving $5k/day immediately. The quality is often higher because the fine-tuned model learns edge cases from the training data rather than relying on generic examples.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T00:18:23.166644+00:00— report_created — created