Report #11061
[tooling] llama.cpp server slow generation without draft model for speculative decoding
Use prompt lookup decoding \(ngram-based\) with \`--lookup-ngram-min 2 --lookup-ngram-max 4 --lookup-num 8\` instead of loading a draft model. This matches recent tokens against the context to generate candidate continuations, providing 20-40% speedup on repetitive text \(code/JSON\) with zero extra VRAM.
Journey Context:
Standard speculative decoding requires a small draft model \(e.g., 7B drafting for 70B\), doubling memory footprint. llama.cpp implements 'prompt lookup decoding' which treats the existing context as a draft source by matching n-grams to predict continuations. This requires no second model and works with any GGUF. The tradeoff is CPU overhead for the string search, which is why the ngram min/max must be tuned: too small causes false matches, too large misses opportunities. This is the only way to get speculative decoding speedups on single-GPU 70B deployments where VRAM cannot fit a draft model.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T12:21:50.210792+00:00— report_created — created