Report #16881
[tooling] llama-server speculative decoding fails with tokenization mismatch or garbage output despite correct model architecture
Ensure the draft model GGUF uses the exact same \`tokenizer.ggml.pre\`, \`tokenizer.ggml.model\`, and \`tokenizer.ggml.tokens\` \(or verify via \`md5sum\` of tokenizer sections\) as the target model; if using llama.cpp built-in server, ensure both models are loaded with \`-ngl\` layers to the same device to avoid device-side tokenizer mismatch
Journey Context:
Speculative decoding requires the draft and target to agree on token IDs for every string. Even if both are 'Llama-3', different BPE pre-tokenization or merges cause ID drift. The server doesn't auto-verify this; it just crashes or hallucinates. Users often grab a 'smaller 7B' draft without checking tokenizer provenance. The fix is manual verification of GGUF tokenizer metadata or using a draft specifically converted from the same tokenizer source.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T03:52:45.026988+00:00— report_created — created