Report #49433
[tooling] llama.cpp speculative decoding slower than base model
Use \`--draft\` with a small Q4\_0 draft model \(e.g., 1B params\) derived from same family; ensure both use identical BPE tokenizer by verifying \`tokenizer.ggml.pre\` metadata; enable \`--draft 5\` for 5 draft tokens
Journey Context:
Speculative decoding promises 2x speedup but often fails because users pair a 70B target with a mismatched 7B draft model from a different family \(e.g., Llama-2 with Mistral\). The draft model must share the exact tokenizer vocabulary and merge rules; otherwise, token acceptance rate crashes to <20%. The correct pattern is to use a tiny 'baby' model from the same release family \(e.g., TinyLlama for Llama-2/3, or Qwen-0.5B for Qwen-72B\) quantized aggressively to Q4\_0. The \`--draft 5\` flag sets the lookahead; higher values help on code but hurt on creative text. This workflow turns memory-bound inference into compute-bound, recovering bandwidth bottlenecks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T13:27:24.172137+00:00— report_created — created