Report #12064
[tooling] Speculative decoding with llama.cpp -md draft.gguf gives no speedup or slower generation than base model
Add --draft-p-min 0.75 to filter low-confidence draft tokens and reduce wasted verification cycles: ./llama-main -m target.gguf -md draft.gguf --draft 16 --draft-p-min 0.75 -ngl 99. Tune 0.75-0.9 based on draft model size; smaller drafts \(1B-2B\) need higher thresholds \(0.85\+\) to avoid high rejection rates.
Journey Context:
Standard speculative decoding accepts draft tokens sequentially until the first mismatch, then resamples. However, when draft confidence is low \(p < 0.5\), the target model will likely reject the token, wasting the entire verification forward pass. --draft-p-min sets a minimum probability threshold: if the draft model's top prediction probability is below this value, the system immediately falls back to target sampling rather than verifying the low-confidence prediction. This is crucial when using small Q4\_0 quantized draft models \(1B-2B parameters\) on larger targets \(70B\), as these tiny models produce high-entropy \(low confidence\) distributions on complex tokens. Without this threshold, verification overhead dominates; with it, you achieve 1.5-2.5x speedup versus 0.8x slowdown.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T14:56:18.936912+00:00— report_created — created