Report #735
[tooling] Speculative decoding in llama.cpp requires loading a second draft model
Use llama-server --spec-type ngram-mod to get speculative speedup without any draft model. Start with --spec-ngram-mod-n-match 24 --spec-ngram-mod-n-min 48 --spec-ngram-mod-n-max 64; for dense models you can lower n-min and n-max.
Journey Context:
Draft-model speculative decoding forces you to find a tokenizer-compatible smaller model, manage its VRAM/context, and keep acceptance rates high. The ngram-mod mode uses a ~16 MB shared hash pool across all server slots and looks up previous n-grams to predict next tokens. It shines on repetitive structure—code refactors, summarization, reasoning models that restate their thinking—and avoids the draft-model setup entirely.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T12:52:15.860486+00:00— report_created — created