Report #94373
[cost\_intel] Speculative decoding latency reduction for large model serving
Implement speculative decoding \(draft-then-verify\) when serving models >20B parameters with latency SLAs <100ms. Use a small draft model \(7B\) to generate 4-5 tokens per verification step, reducing time-to-first-token by 40% at the cost of 3x compute overhead during generation.
Journey Context:
Autoregressive decoding is memory-bandwidth bound; each token requires loading the full model weights. Speculative decoding amortizes this cost: the draft model generates candidates quickly \(low memory bandwidth per token due to small size\), and the large model verifies them in parallel. The break-even is when acceptance rate >0.6 \(typically achieved with 7B draft \+ 70B target on similar distributions\). Latency drops 40-60%, but total FLOPs increase 2-3x, raising compute costs. Only use when latency is the constraint, not raw compute budget.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T16:59:21.713908+00:00— report_created — created