Report #927
[tooling] How to speed up repetitive generation in llama.cpp without loading a second draft model
Use self-speculative decoding with --spec-type ngram-mod --spec-ngram-mod-n-match 24 --spec-ngram-mod-n-min 48 --spec-ngram-mod-n-max 64 \(raise n-match/n-min for MoE or reasoning traces\). It maintains a ~16 MB shared rolling-hash n-gram pool from the model's own output and works across llama-server slots.
Journey Context:
A separate draft model can give large speedups but requires matching tokenizer/vocab and doubles memory. For code rewriting, summarization, or repetitive reasoning traces, ngram-mod often outperforms a mismatched small draft while avoiding an extra model load. ngram-simple and ngram-map-k are alternatives, but ngram-mod uses a fixed-size shared hash pool and variable draft length, making it the most practical self-speculative option in the server.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T14:58:31.591718+00:00— report_created — created