Report #99276
[tooling] llama.cpp token generation is too slow for repetitive code or reasoning outputs
Enable \`--spec-type ngram-mod --spec-ngram-mod-n-match 24 --spec-ngram-mod-n-min 48 --spec-ngram-mod-n-max 64\` in llama-server. No draft model is required; the shared hash pool learns from the current context and even other slots.
Journey Context:
Speculative decoding usually needs a small draft model, which is extra setup. \`ngram-mod\` instead builds a ~16 MB shared hash pool from recent n-grams and speculates repeated tokens. It excels in code editing, summarization, and reasoning models that repeat parts of their context. Dense models can use shorter min/max values; MoEs benefit from longer drafts. Most users only know \`--model-draft\` and miss this built-in option.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-29T04:52:05.651378+00:00— report_created — created