Report #95746
[tooling] No speedup from speculative decoding in llama.cpp because draft model doesn't fit in VRAM alongside main model
Enable n-gram speculative decoding \(self-speculation\) using \`--lookup-ngram-min N\` \(e.g., N=2\) and \`-np 1\` \(parallel sequences\) where draft tokens are generated from the model's own previous tokens via n-gram lookup, eliminating the need for a separate draft model. Best for repetitive text like code or structured data.
Journey Context:
Standard speculative decoding requires a smaller draft model \(e.g., 7B drafting for 70B\) which often won't fit in VRAM alongside the main model on consumer cards \(e.g., 2x24GB\). Users abandon speculative decoding thinking it's impossible. However, llama.cpp implements n-gram lookup speculative decoding \(also called prompt lookup decoding\) where the draft tokens are copied from recently generated tokens using n-gram matching. This works without any draft model and gives 1.3-2x speedup on repetitive code. The \`--lookup-ngram-min\` sets minimum n-gram size \(try 2 or 3\). This is distinct from standard speculative decoding with \`--model-draft\`.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T19:17:36.420376+00:00— report_created — created