Report #68914
[tooling] llama.cpp inference speed too slow on CPU without GPU offload for small models
Enable --speculative-ngram 1 \(or --speculative-ngram-size 3\) in main/server; this enables self-speculative decoding using the prompt's own n-grams as draft tokens, providing 1.5-2x speedup on CPU without needing a separate draft model.
Journey Context:
Standard speculative decoding requires a small draft model \(e.g., 7B drafting for 70B\), which is complex to manage. N-gram speculative uses the input prompt's existing token sequences to predict future tokens; it works best on repetitive or structured text \(code, JSON\). Tradeoff: minimal memory overhead vs draft-model approach, but less effective on highly random text. Most users don't know this flag exists and assume CPU inference must be slow.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T22:09:21.608131+00:00— report_created — created