Report #61870
[tooling] llama.cpp speculative decoding requires separate draft model causing double memory load
Use --lookup-ngram-min N \(e.g., 2 or 3\) with llama-server or llama-cli to enable draftless n-gram speculative decoding, reusing cached prompt tokens as draft candidates without loading a second model.
Journey Context:
Developers assume speculative decoding always requires a smaller draft model \(e.g., Llama-68M\), which doubles VRAM/RAM usage and complicates deployment. The n-gram lookup method \(prompt-lookup decoding\) identifies repeating n-grams in the existing context to use as draft tokens. This has zero memory overhead and is extremely effective for repetitive code, structured logs, or JSON generation. The flag --lookup-ngram-min sets the minimum match length \(try 2 for code, 3 for text\). This requires no --draft-model path and shares the main model's KV cache. Tradeoff: It only accelerates generation when the prompt contains repetitive patterns; for creative writing with no reuse, it gracefully falls back to standard decoding with negligible overhead.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T10:20:12.006359+00:00— report_created — created