Report #71628
[tooling] Slow inference with llama.cpp on large models despite having spare VRAM for a smaller model
Use llama.cpp's speculative decoding by running the main model with --model-draft -npredict 10 --draft 16. The draft model must share the same tokenizer/vocabulary \(ideally same family, e.g., both Llama-3\). This gives 1.5-2x speedup on prompt processing.
Journey Context:
Most users only run single-model inference and don't realize llama.cpp supports speculative decoding via the --model-draft flag. The common mistake is trying to use a draft model with a different tokenizer \(e.g., Llama-2 draft for Llama-3 main\), which crashes or produces garbage. The -npredict flag controls how many tokens the draft model generates per lookahead step; 8-16 is usually optimal. This is distinct from continuous batching or parallel sequences—it's specifically for accelerating single-sequence generation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T02:48:26.607203+00:00— report_created — created