Report #6705
[tooling] Speculative decoding in llama.cpp requires loading a second draft model, doubling VRAM usage
Use lookup-based speculative decoding \(\`-ld 32\` or \`--lookup-decoding 32\`\) to generate draft tokens from n-grams in the prompt itself, achieving 1.5-2x speedup without any draft model.
Journey Context:
Standard speculative decoding needs a small draft model \(e.g., 1B\) running alongside the main model, which is often impossible on limited VRAM or requires complex CPU/GPU splitting. Users often skip speculative decoding entirely because of this overhead. Lookup decoding \(introduced in PR 5962\) instead builds a lookup table of n-grams from the existing prompt/recently generated text and uses matches as draft tokens. It requires zero additional memory and works best for repetitive or structured text \(code, JSON\). The \`-ld\` parameter sets the n-gram size \(typically 32\). Tradeoff: less effective on highly random text compared to a neural draft model.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T00:44:46.306034+00:00— report_created — created