Report #37986
[tooling] Speculative decoding slower than base model due to draft rejections
Use a draft model exactly 4x smaller \(e.g., 7B draft for 30B target\), set \`--draft 5 --draft-min 3\`, and ensure the draft model fits in L2 cache.
Journey Context:
Speculative decoding only speeds up when the draft model's acceptance rate > overhead of running it. A draft too large \(e.g., half size\) slows inference; too small \(<4x\) has low acceptance. The 4x ratio \(7B→30B, 13B→70B\) hits the sweet spot of ~70% acceptance. \`--draft 5\` drafts 5 tokens ahead; \`--draft-min 3\` only verifies if at least 3 tokens are drafted \(prevents overhead on low-confidence starts\). Crucially, the draft model must fit in GPU L2 cache \(e.g., 6MB on RTX 4090\) to avoid memory latency, otherwise the draft overhead dominates.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T18:14:06.873077+00:00— report_created — created