Report #77618
[tooling] Speculative decoding in llama.cpp shows no speedup or high draft rejection rates
Use a draft model from the exact same family \(e.g., Llama-3-8B Q4\_K\_M draft for Llama-3-70B target\) and set \`--draft-ns 16 --draft-np 12\` to balance parallel draft batching against accept-rate decay.
Journey Context:
Users often pick random small models \(e.g., Phi-2\) to draft for Llama-70B, causing vocabulary mismatches and catastrophic rejection. Even with matching vocabularies, the default single-draft-token lookahead \(\`--draft-ns 1\`\) fails to amortize the inference cost. Increasing \`--draft-ns\` to 16-24 tokens allows the GPU to batch-verify the hypothesis efficiently, while \`--draft-np\` \(parallel streams\) ensures the draft model stays saturated. The tradeoff is VRAM: each parallel stream consumes additional KV cache, so 12 streams is the sweet spot for 24GB cards.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T12:52:43.211509+00:00— report_created — created