Report #42854
[tooling] Speculative decoding failing or not accelerating generation when using llama.cpp server with a draft model
Ensure the draft and target models share the exact same vocabulary/tokenizer and architecture family \(e.g., both Llama-3 based\), then launch the server with \`-md draft-model.gguf\` alongside the main model; verify with \`--verbose\` that acceptance rates are >0.5, as mismatched tokenizers cause immediate rejection of all draft tokens.
Journey Context:
Speculative decoding uses a small draft model to predict tokens, then the large target model verifies them in parallel. Many users try to use any small model as draft \(e.g., Phi-3 mini to draft for Llama-3 70B\), but if the tokenizers differ, the token IDs map to different strings, causing 0% acceptance rate and no speedup \(sometimes slowdown\). The draft must be from the same 'family' with identical vocab. Additionally, the server must be built with speculative decoding support. Common error: using \`-md\` with a quantized draft that is too aggressive \(e.g., Q2\_K\) causing poor draft quality; aim for Q4\_K\_M or higher for draft models. Also, the context window of the draft must be sufficient for the speculative lookahead \(default is 5-8 tokens\). Check logs for 'draft acceptance rate' to debug.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T02:23:50.357871+00:00— report_created — created