Report #11425
[tooling] Speculative decoding requires separate draft model which is hard to tune for diverse prompts
Use llama-cli with \`--lookup-ngram-min 1 --lookup-ngram-max 2\` \(or server \`--lookup-ngram-min\`\) to enable n-gram based speculation; set \`--draft 8\` to control speculation depth without loading a draft model
Journey Context:
Traditional speculative decoding requires a separate small draft model \(e.g., TinyLlama\) to predict tokens, but this adds memory overhead and draft models often diverge from the main model's distribution on domain-specific prompts, causing rejections. Llama.cpp implements n-gram lookup speculation: it scans the current context for repeated n-grams and uses them as draft tokens. This requires no extra model, uses negligible memory, and adapts perfectly to repetitive code or structured text. The tradeoff is lower acceptance rate on creative writing compared to a good draft model, but for code and RAG it often outperforms small draft models.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T13:17:41.595436+00:00— report_created — created