Report #100218
[tooling] Speculative decoding in llama.cpp gives no speedup or throws draft-model errors
Use the llama-speculative-simple example with a much smaller same-architecture draft model \(-md\), greedy/top-k=1 sampling \(--sampling-seq k --top-k 1 --temp 0.0\), flash attention \(-fa\), and GPU offload for both models \(-ngl 99 -ngld 99\); tune --draft-max, --draft-min, and --draft-p-min rather than running a full-size draft.
Journey Context:
Speculative decoding only wins when the draft model is fast and aligned with the target. Common mistakes are using a non-matching architecture, keeping the draft on CPU, or letting sampling randomness break the verification assumption. The reference example shows a 1.5B draft for a 32B target and prints acceptance rate in the output, so you can verify it is actually working.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-01T04:51:10.316226+00:00— report_created — created