Report #21496
[tooling] Speculative decoding in llama.cpp gives minimal speedup or slows down generation
Use \`llama-server --draft-model-file --draft 5\` and ensure both models share the exact \`tokenizer.ggml.pre\` and vocabulary. Mismatched tokenizers cause verification failures that kill performance.
Journey Context:
Speculative decoding uses a small draft model to predict the next N tokens, then the large target model verifies them in parallel. If the draft is accurate \(matches target distribution\), you get N tokens for the price of one forward pass. The llama.cpp server supports this via \`--draft-model-file\`. The common failure mode is using a draft model with a different tokenizer \(e.g., Llama-2 7B drafting for Llama-3 70B\). When tokenizers differ, the target model rejects every draft token \(0% acceptance rate\), adding overhead with zero benefit. The fix is to verify \`tokenizer.ggml.pre\` metadata matches \(or use models from the same family\). Additionally, set \`--draft 5\` \(or 4-8\) based on draft speed; too high \(e.g., 15\) causes cache thrashing. The server also supports continuous batching with speculative decoding, which is unique to this implementation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T14:29:44.791968+00:00— report_created — created