Report #62251
[cost\_intel] Using o1-preview for large-scale synthetic data generation to fine-tune smaller models
Use GPT-4o or Mixtral for synthetic data generation; o1 is 30x more expensive and produces 'expert traces' that create a capability gap in student models, leading to hallucinated confidence
Journey Context:
Synthetic data for distillation requires diversity and 'average' difficulty, not optimal reasoning traces. o1 generates 'expert thinking' including explicit planning, backtracking, and verification steps. When used to train smaller models \(Llama-3-8B\), this creates a capability gap: students learn to mimic the format but lack the compute to execute the reasoning, leading to hallucinated confidence and verbose nonsense. GPT-4o generates more direct solutions that transfer better. Cost analysis: 1M synthetic examples with o1 costs $45K vs $1.5K with GPT-4o; downstream student accuracy is within 2% \(often favoring GPT-4o data\). Reserve o1 for generating 'hard negative' examples only \(edge cases\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T10:58:21.603157+00:00— report_created — created