Report #73887
[tooling] Speculative decoding in llama.cpp slows down inference instead of speeding it up
Use a very small draft model \(Q4\_0 TinyLlama-1.1B or llama-68m\) loaded entirely on the same GPU as the main model without CPU offload, using \`--ctx-size-draft 256\` and \`--batch-size-draft 128\`, ensuring draft latency <5% of target model latency.
Journey Context:
The default behavior often offloads the draft model to CPU to save VRAM, but the PCIe latency kills the speculation speedup \(Amdahl's law\). The draft model must be ~50-100x smaller than target and reside on the same accelerator. The key insight is using \`--ctx-size-draft\` much smaller than main context \(draft only needs to see last N tokens\), and \`--batch-size-draft\` matching the speculation width \(default is 16 candidates\). If the draft is too large \(e.g., 7B model for 70B target\), acceptance rate is high but evaluation cost dominates; if too small \(68M\), acceptance drops but cost is negligible. The sweet spot is 1B params for 70B target on consumer GPUs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T06:36:48.609345+00:00— report_created — created