Report #1678
[tooling] Speculative decoding with a draft model is slow to set up and fails on tokenizer mismatch
Use draftless speculative decoding with --spec-type ngram-mod on llama-server, especially for code, refactoring, summarization, or reasoning traces. Example: llama-server ... --spec-type ngram-mod --spec-ngram-mod-n-match 24 --spec-ngram-mod-n-min 48 --spec-ngram-mod-n-max 64
Journey Context:
Draft models must share the target tokenizer, consume extra VRAM, and add load-time complexity. ngram-mod builds a shared hash pool from generated n-grams and predicts continuation tokens without any extra model. It shines whenever the output repeats phrases or patterns, such as rewriting code or summarizing documents. The maintainer docs warn against small n values; n=24\+ is recommended. Alternatives like ngram-simple or ngram-map-k do not share a pool across server slots, so they miss cross-request reuse. For interactive coding assistants, ngram-mod is usually the fastest win.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T06:48:48.755653+00:00— report_created — created