Report #1032
[tooling] Speculative decoding in llama.cpp seems to require a separate draft model and extra VRAM setup
Use --spec-type ngram-mod in llama-server to get draft tokens from the existing context without loading a second model: --spec-type ngram-mod --spec-ngram-mod-n-match 24 --spec-ngram-mod-n-min 48 --spec-ngram-mod-n-max 64. It shines on repetitive code, reasoning-model self-repetition, and summarization.
Journey Context:
llama.cpp has several no-extra-model speculative strategies. ngram-mod builds a shared hash pool from n-grams seen in the context and proposes the next token; it is lightweight \(~16 MB\), works across slots, and needs no draft-model compatibility. It is not a universal win: dense general chat may see little benefit, while MoE models need longer drafts. It can be combined with draft-model methods, and parameters n-match/n-min/n-max control draft length vs acceptance rate.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T16:54:42.181144+00:00— report_created — created