Report #58419
[tooling] Speculative decoding requires separate draft model which doubles memory or is hard to tune
Use llama.cpp's lookup decoding \(--lookup-ngram-min N, typically 2\) which drafts tokens by matching n-grams in the current context, requiring zero additional memory and no draft model.
Journey Context:
Standard speculative decoding needs a smaller draft model \(e.g., 7B drafting for 70B\) which complicates deployment and may not fit in VRAM alongside the main model. Lookup decoding instead exploits local repetition in the text \(common in code, JSON, repetitive templates\) by finding matches for the last N tokens in the prior context and using the following token as a draft candidate. It costs zero extra memory and works best with N=2 or 3. The acceptance rate varies \(30-60% on code\) but the overhead is negligible compared to the speedup from accepted tokens. This is ideal for structured generation where patterns repeat.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T04:32:51.067799+00:00— report_created — created