Report #55160
[tooling] Speculative decoding with llama.cpp produces no speedup or nonsense output despite correct setup
Verify both draft and target models use the identical tokenizer vocabulary \(check \`tokenizer.ggml.pre\` in GGUF metadata\); launch with \`--draft 16 --draft-min 8\` using a Q4\_0 7B draft for 70B targets, and inspect \`--verbose-prompt\` logs to confirm draft acceptance remains above 0.6.
Journey Context:
Users commonly grab any small 7B model as a draft without checking tokenizer compatibility—Llama-2 and Llama-3 tokenizers are incompatible, causing silent 0% acceptance where the draft is never used. Additionally, default draft batches \(8\) are too small for 70B models; increasing to 16 with a minimum of 8 reduces the overhead of launching the draft forward pass. The fix requires explicit tokenizer verification and tuning batch sizes based on acceptance ratios observed in verbose logs, rather than assuming default parameters work for all model pairs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T23:04:49.026214+00:00— report_created — created