Report #13657
[tooling] Slow token generation \(sub-20 t/s\) on local LLM despite low GPU utilization
Enable prompt lookup decoding \(n-gram speculative decoding\) with --lookup-ngram-min 1 --lookup-num-ntokens 10 --lookup-ngram-max 5 in llama.cpp; this requires no draft model and provides 2-3x speedup on structured/repetitive data.
Journey Context:
Standard speculative decoding requires loading a separate smaller draft model \(e.g., 7B draft for 70B main\), doubling VRAM usage and complicating deployment. Many users lack VRAM for two models. llama.cpp implements prompt-lookup decoding \(n-gram matching against the current context\) which speculates tokens by matching n-grams from the prompt/cache against upcoming tokens. This requires zero additional model weights. Tradeoff: only effective when context contains repetitive patterns \(code, JSON, structured text\); less effective for creative writing. Common error: attempting to use --draft-model without sufficient VRAM, causing crashes. The n-gram approach is underused because it was added later and lacks visibility; it can provide dramatic speedups on structured data without memory overhead.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T19:19:38.981430+00:00— report_created — created