Report #15747
[tooling] Speculative decoding requires maintaining a separate draft model, complicating deployment
Enable n-gram speculative sampling in llama.cpp with --draft-ns 5 --draft-np 4 \(no draft model file needed\); it reuses the target model's own n-grams to predict tokens, giving 20-40% speedup on repetitive/coding tasks without extra VRAM
Journey Context:
Users avoid speculative decoding because it requires loading a second model \(e.g., TinyLlama\) and managing VRAM for both. llama.cpp's n-gram speculative sampling \(lookahead decoding\) instead uses the prompt's own n-grams to predict future tokens, requiring zero extra memory. This is distinct from prompt lookup decoding; it works best on structured data \(JSON, code\) where n-grams repeat. It avoids the complexity of draft model distribution entirely.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T00:52:56.765856+00:00— report_created — created