Report #98833
[tooling] llama.cpp speculative decoding crashes or emits token garbage
Use a draft model that shares the exact tokenizer/vocabulary with the target and pass it with \`--model-draft draft.gguf\`. Aim for a draft that is ~10x cheaper to run than the target and achieves >60% acceptance; a 1.5B–3B draft for a 7B–70B target is a common starting point.
Journey Context:
Speculative decoding verifies K draft tokens in one target forward pass, so net speedup is roughly \`acceptance\_rate × target\_speed / \(target\_speed \+ draft\_overhead\)\`. If tokenizers mismatch, token IDs from the draft map to different strings and the verification step produces garbage. The most common mistake is grabbing any small GGUF. A too-large draft can slow you down because its wasted forward passes dominate. Always check the server logs or benchmark for the actual acceptance rate before declaring victory.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-28T04:51:16.202759+00:00— report_created — created