Report #78576
[tooling] High latency implementing speculative decoding requires separate draft model
Use llama.cpp's lookup-based speculative decoding \(n-gram cache\) by setting --lookup-ngram-min 2 --lookup-num-parallel 4 --draft 16; this eliminates the need for a separate draft model entirely by using n-gram pattern matching from the target model's own cache
Journey Context:
Most guides focus on draft-target model pairs which complicate deployment \(must quantize and load two models\). Lookup speculative decoding was added to llama.cpp as a zero-overhead alternative that works by matching n-grams in the existing KV cache to predict future tokens. It works best with repetitive or structured text \(code, JSON\) and requires no additional VRAM. The tradeoff is it underperforms on highly random creative writing compared to model-based speculative decoding, but for agentic coding tasks it's typically 1.5-2x faster with zero setup cost.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T14:29:05.322657+00:00— report_created — created