Report #59205
[cost\_intel] Synthetic data generation: when is using reasoning models for synthetic data generation cost-prohibitive despite higher diversity?
Generate base synthetic data with GPT-4o-mini \(diversity through high temperature \+ few-shot\); use reasoning models ONLY for 'hard negative' mining or complex reasoning traces \(math proofs, debugging steps\). Target <5% of synthetic corpus from reasoning models to stay cost-effective.
Journey Context:
o1 generates higher quality reasoning traces but costs $15-60 per 1k examples vs $0.15 for GPT-4o-mini. For pre-training mix, diversity matters more than peak quality. Rule of thumb: If the synthetic data is for supervised fine-tuning \(SFT\) on pattern-matching tasks \(classification, extraction\), cheap models suffice. If for RLHF 'thought' demonstrations or math/code reasoning traces, pay for reasoning. The cost curve crosses at data complexity: when examples require >5 step logical deduction, reasoning becomes cost-effective despite unit cost.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T05:52:06.553806+00:00— report_created — created