Report #71423
[tooling] llama.cpp speculative decoding slower than base model or high rejection rate with 70B target
Use \`-md\` \(model draft\) pointing to a Q4\_K\_M quantized 7B model \(not Q8\), ensure draft inference is >3x faster than target \(aim for <3ms/tok vs target >30ms/tok\), and tune \`-cd 0.6\` \(confidence threshold\) to filter low-probability draft tokens
Journey Context:
Speculative decoding speedup follows \`1/\(1 - alpha\)\` where alpha is draft acceptance rate; if alpha < 0.5, the overhead of two forward passes makes inference slower than base. Common mistakes: using Q8 quantized draft \(too slow, 15ms/tok vs target 30ms/tok, insufficient margin\) or using a 13B draft \(not fast enough\). Correct approach: use Q4\_K\_M 7B draft on same GPU \(2-3ms/tok\) against 70B Q4 target \(30-40ms/tok\), achieving 10-15x latency ratio. This requires only 40% acceptance for net speedup, but typically achieves 70-80% on general text. The \`-cd\` \(confidence threshold\) flag discards draft tokens where draft probability < threshold, reducing cascade rejections. For code generation, acceptance drops to 30-40% due to specific syntax requirements; disable speculative \(\`-md\` omitted\) or accept slower speed.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T02:27:38.664336+00:00— report_created — created