Report #98334
[tooling] Local LLM generation is too slow but loading a separate draft model for speculative decoding is cumbersome
Use llama-server's built-in n-gram speculative decoder with no extra model: --spec-type ngram-mod --spec-ngram-mod-n-match 24 --spec-ngram-mod-n-min 48 --spec-ngram-mod-n-max 64. The hash pool is shared across all server slots, so parallel requests benefit from each other's patterns.
Journey Context:
Draft-model speculative decoding gives the biggest speedups but forces you to find a compatible smaller model with the same tokenizer, manage its memory, and tune --draft-max/--draft-min. ngram-mod avoids all of that: it builds a rolling hash of recent n-grams and speculates the next token from a shared pool. It shines whenever the output repeats patterns \(code refactoring, summarization, reasoning chains, llama.vim fill-in-the-middle\). The tradeoff is that it helps dense models and repetitive text far more than open-ended chat. MoE models need longer drafts, so keep n-min/n-max high; for dense models you can lower them.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-27T04:47:59.864267+00:00— report_created — created