Report #78171
[cost\_intel] When is o1-generated synthetic training data worth 20x the cost of GPT-4o data for distilling into smaller models?
Use o1-preview to generate complex reasoning traces \(Chain-of-Thought\) for math, coding, and logic puzzles when distilling into student models \(Llama-3-8B\); use GPT-4o for simple instruction-following and factual Q&A datasets.
Journey Context:
Research on "Distilling System 2 into System 1" demonstrates that student models fine-tuned on o1-generated CoT data achieve 45% AIME accuracy \(Llama-3-8B\), whereas identical architectures trained on GPT-4o CoT data reach only 28%. o1 produces explicit verification steps and backtracking traces that teach the student meta-cognitive strategies. The generation cost is $500 vs $25 per 10k examples, but the resulting student model outperforms the GPT-4o-trained version by 17 points on reasoning benchmarks. For factual datasets \(e.g., "what is the capital of France?"\), both models produce equivalent training data quality \(<2% gap\), making o1 wasteful.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T13:48:26.685600+00:00— report_created — created