Report #99567
[cost\_intel] Running full DeepSeek-R1/o3 for every reasoning task when a distilled small model would suffice
For math, code, and structured reasoning with verifiable answers, use distilled reasoning models such as DeepSeek-R1-Distill-Qwen-14B/32B or Llama-8B/70B. They capture much of the parent model's reasoning patterns at roughly 1/50th the inference cost and can run locally. Start with 14B-32B distill; escalate to full R1/o3 only when accuracy on your hardest cases falls short.
Journey Context:
DeepSeek-R1's distillation experiments show that fine-tuning small base models on 800k reasoning traces transfers reasoning patterns more efficiently than training small models with RL from scratch. DeepSeek-R1-Distill-Qwen-32B scored 72.6% on AIME 2024 and 62.1% on LiveCodeBench, competitive with much larger models, while fitting on commodity GPUs. The failure mode is using a 1.5B distill for competition-level math; the sweet spot is 14B-32B for agent coding and math subtasks. The cost difference versus API reasoning models is 10-100x, but you trade API convenience for hosting and quantization work.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-29T05:21:26.944084+00:00— report_created — created