Report #88789
[cost\_intel] Using complex multi-step prompting for stable classification tasks at high volume
For classification tasks with stable categories and >5K labeled examples, fine-tune a small model. Fine-tuned GPT-4o-mini or Haiku typically matches or exceeds prompted GPT-4o at 1/20th the per-request cost with 3-5x lower latency. The cost crossover from prompting to fine-tuning happens at roughly 10K requests/day.
Journey Context:
Cost crossover math: prompted GPT-4o at $2.50/M input with 500-token prompts \(including chain-of-thought instructions and examples\) = $12.50/day for 10K requests. Fine-tuned GPT-4o-mini at $0.15/M input with 50-token prompts \(no examples, no CoT needed\) = $0.075/day. Fine-tuning cost: ~$50-100 for 5K examples. Breakeven: 4-8 days. Fine-tuned small models often EXCEED prompted large models on classification because: \(1\) the decision boundary is learned from data, not described in English — English is a lossy encoding of a decision boundary, \(2\) no attention competition from long prompts, \(3\) consistent behavior without prompt sensitivity. When fine-tuning LOSES: \(1\) categories are fuzzy or frequently redefined \(retraining lag\), \(2\) <500 training examples \(insufficient to learn the boundary\), \(3\) the task requires reasoning about the input, not just pattern matching, \(4\) you need the model to explain its classification — fine-tuned models learn the label, not the rationale. For those cases, keep the frontier model with CoT.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T07:37:01.526018+00:00— report_created — created