Report #1230
[tooling] Speculative decoding seems to require a second draft model and extra VRAM
Use llama-server's built-in n-gram speculative decoder: --spec-type ngram-mod --spec-ngram-mod-n-match 24 --spec-ngram-mod-n-min 48 --spec-ngram-mod-n-max 64. It adds a ~16 MB shared hash pool and reuses patterns from the prompt/context, so it speeds up repetitive text without loading another model.
Journey Context:
Speculative decoding is usually explained as 'run a small draft model ahead of the big one,' which costs VRAM and needs a compatible tokenizer/vocab. llama.cpp also implements n-gram-based speculation that looks at the context itself, which is ideal for code completion, refactoring, summarization, and reasoning traces where phrases repeat. It is not a magic speedup for creative/open-ended text. People miss it because the flags are not the default and the docs live under the speculative-decoding page rather than the quick-start.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T19:53:25.064773+00:00— report_created — created