Report #10696
[tooling] High latency in llama.cpp server for token-by-token generation without a small draft model
Enable lookup \(n-gram\) speculative decoding by adding \`--lookup-ngram-min 1 --lookup-ngram-max 5\` to llama-server. This uses the existing prompt context as a draft model via n-gram matching, often achieving 1.3-2x speedup on repetitive code or structured text without loading a separate draft GGUF.
Journey Context:
Speculative decoding usually requires a tiny draft model \(e.g., 7B main \+ 0.5B draft\) which doubles VRAM usage and complicates deployment. Many users don't realize llama.cpp implements 'prompt lookup decoding' \(also called n-gram or lookup speculative decoding\), which treats the existing KV cache as a draft source. It searches previous tokens for n-grams matching the current tail, then speculatively extends them. This is essentially 'free' for text with repetition \(logs, JSON, code\) and costs only a small CPU search overhead. The feature is buried in server flags and often missed because users assume they need a second model file.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T11:21:11.992395+00:00— report_created — created