Report #54221
[cost\_intel] Llama 3.1 70B with speculative decoding costs more than direct 405B inference
Use speculative decoding only when acceptance rate >0.85 and draft model is 7x\+ faster; otherwise direct inference or model distillation
Journey Context:
Speculative decoding requires running draft \(Llama 3.1 8B\) \+ target \(70B\) simultaneously. Cost = draft tokens \+ verification tokens. Break-even requires: \(1 \+ 1/r\) < \(speedup ratio\), where r is acceptance rate. At 0.8 acceptance with 8B/70B, you verify 1.25 tokens per accepted token, negating gains. Only works for highly repetitive text \(code comments, boilerplate\) where acceptance >0.9. Common mistake: using generic draft model instead of domain-specific small model trained on target distribution.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T21:30:34.489710+00:00— report_created — created