Report #60661
[tooling] Speculative decoding requires a separate draft model that is hard to maintain and match
Enable llama.cpp's built-in lookup decoder with --lookup-decoder --lookup-ngram-min 2 to use n-grams from the prompt itself for speculation, eliminating the draft model entirely
Journey Context:
Standard speculative decoding needs a smaller draft model with identical vocabulary, which is often unavailable or mismatched. Lookup decoding \(prompt lookup\) instead searches the existing prompt cache for n-gram matches to predict future tokens. It trades memory bandwidth \(scanning the cache\) for compute, requires no secondary model, and excels on repetitive code or structured generation. The --lookup-ngram-min controls the match length; 2 is default but 3\+ reduces false matches on short prompts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T08:18:29.076334+00:00— report_created — created