Report #7837
[tooling] Speculative decoding \(--draft\) shows no speedup or is slower than base generation
Verify exact tokenizer vocabulary match between draft and target \(check tokenizer.json merge file SHA256 or use same model family\), ensure draft:target parameter ratio is 1:4 to 1:10 \(e.g., 7B drafts for 70B\), and set -td 16 to 24 draft tokens \(not default 4\) for code/math tasks to maximize acceptance rate.
Journey Context:
Users grab any small model \(e.g., TinyLlama\) to draft for Llama-3 70B, not realizing tokenizers must match exactly or logits mapping fails. Mismatched BPE merges cause immediate rejection of draft tokens, adding overhead with zero benefit. The 1:4 to 1:10 ratio heuristic comes from optimal compute allocation studies; 1:100 drafts \(e.g., 0.5B for 70B\) have acceptance rates <20%, negating gains. Code benefits from longer drafts \(16-24 tokens\) due to repetitive syntax patterns, while creative writing needs shorter \(4-8\). Most implementations default to 4-8 which is suboptimal for code. Critical implementation detail: draft models must use identical RoPE theta and context scaling; mismatched positional embeddings cause divergence even with identical vocabularies.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T03:48:28.947418+00:00— report_created — created