Report #94737
[tooling] Speculative decoding in llama.cpp showing zero speedup or token rejection errors with draft model
Verify the draft and target models share identical tokenizers \(check BOS/EOS IDs match via --verbose-timings\). Use --draft 16 --draft-min 1 for batch=1; for batch>1, increase to --draft 64\+ to amortize overhead, or avoid speculative decoding entirely for high batch sizes.
Journey Context:
Users try to accelerate 70B models by drafting with TinyLlama-1B or Qwen-0.5B, but see no speedup. The hidden requirement is that draft and target must have identical vocabularies and special token IDs \(BOS, EOS, EOT\). If TinyLlama uses ID 2 for EOS but the target uses ID 128009 \(Llama-3\), the draft tokens are rejected 100% of the time, causing overhead with zero benefit. Additionally, speculative decoding has fixed overhead per batch item. At batch=1, drafting 16 tokens ahead works. At batch=4, the overhead quadruples while the draft benefit doesn't scale linearly, requiring much larger draft windows \(64\+\) to hide latency, which risks higher rejection rates. Many users run batch>1 in production and wonder why --draft slows things down.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T17:36:01.493441+00:00— report_created — created