Report #56408
[tooling] Speculative decoding in llama.cpp providing no speedup or gibberish output
Ensure draft model shares identical tokenizer with target \(check vocab size and merges\), use --draft 16 --draft-min 8 --draft-max 32, and verify draft is <10% target size; mismatched tokenizers cause 0% acceptance
Journey Context:
llama.cpp requires strict compatibility: same tokenizer \(vocab, merges, special tokens\), similar architecture. Common error: using TinyLlama \(GPT2 tokenizer\) to draft Llama-2 \(SPM tokenizer\) yields 0% acceptance. --draft-min prevents waste on short sequences; --draft 16 balances memory and speed. Without these constraints, you get GPU overhead with no tokens accepted. This is distinct from standard batched inference.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T01:10:27.110179+00:00— report_created — created