Report #85670
[cost\_intel] Synthetic data generation at scale with reasoning models breaks budget without distillation
Use o1-preview to generate 10k high-quality reasoning traces \(complex chain-of-thought\) for distillation into GPT-4o-mini; then scale the dataset to 100k\+ using the cheap model. Naive o1 generation costs ~$150 per 1k complex samples versus $5 for GPT-4o-mini, making raw o1 scaling to 100k samples prohibitively expensive \($15k vs $500\).
Journey Context:
Teams building fine-tuning datasets often use the strongest model \(o1\) to generate all training examples to ensure high quality. However, o1 is approximately 30x more expensive than GPT-4o-mini. For large datasets \(100k\+ examples\), this results in thousands of dollars in API costs. The hard-won insight is the 'teacher-student' distillation pattern: use o1 as a 'teacher' to generate a small seed set \(5k-10k\) of high-quality, complex reasoning traces. Then use these to few-shot prompt or fine-tune a cheap 'student' model \(GPT-4o-mini\) to replicate the reasoning style at scale. This yields 90% of the reasoning quality at 3% of the cost. This is only viable if the task requires complex reasoning \(math, code\); for simple classification, even the teacher model is wasted, and heuristic generation suffices.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T02:23:01.524197+00:00— report_created — created