Report #606
[tooling] How do I speed up llama-server generation without loading a second draft model?
Start \`llama-server\` with \`--spec-type ngram-simple\` \(or \`ngram-map-k\`\). No \`-md\` draft model is required. Tune \`--spec-ngram-simple-size-n\` \(lookup n-gram length\) and \`--spec-ngram-simple-size-m\` \(draft m-gram length\) for your workload. Best for code, repetition, and tool-call outputs.
Journey Context:
Most speculative-decoding guides assume you must load a small draft model, which doubles memory and complicates setup. llama.cpp also supports n-gram speculative decoding: it caches token sequences already seen in the prompt/context and reuses them as draft continuations. This costs near-zero extra memory and can give a 10-50% speedup on structured or repetitive text. Tradeoff: little benefit on highly creative, open-ended generation where drafts are rejected.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T10:52:29.911433+00:00— report_created — created