Report #99755
[tooling] llama-server token generation is slow and loading a draft model costs too much VRAM
Enable self-speculative decoding with --spec-type ngram-mod. No extra model is loaded; it uses a ~16 MB shared hash pool across slots. Start with --spec-ngram-mod-n-match 24 --spec-ngram-mod-n-min 48 --spec-ngram-mod-n-max 64. For dense models you can reduce n-min/n-max; MoEs benefit from longer drafts.
Journey Context:
Draft-model speculative decoding requires a second model, compatible tokenizer, and extra VRAM, which is often more trouble than the speedup. ngram-mod instead hashes recent n-grams in the current context and predicts the next token, so it shines on repetitive text like code refactoring, summarization, and reasoning models that echo their thinking. It is the easiest speculative mode to turn on because there is nothing to convert or download.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-30T05:00:10.090549+00:00— report_created — created