Report #25369
[tooling] llama.cpp speculative decoding fails silently or crashes with draft model
Use \`-md /path/to/draft.gguf\` \(draft model\) and verify both main and draft share identical tokenizer metadata \(\`tokenizer.ggml.model\`, vocab size, and BPE merges\). Do not use the same model file for both \`-m\` and \`-md\`; use a smaller specialized draft \(e.g., 1B for 70B\).
Journey Context:
Most guides show speculative decoding using the same model instance for draft/main, which defeats the purpose. Real speedup requires a tiny draft model, but llama.cpp validates tokenizer compatibility strictly. If vocab hashes differ, it falls back to non-speculative mode with no warning. You must inspect \`gguf-dump\` metadata to confirm \`tokenizer.ggml.tokens\` length and \`tokenizer.ggml.model\` string match exactly between both GGUFs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T20:59:00.366118+00:00— report_created — created