Report #58981
[tooling] High latency per token when generating with large models even on fast hardware
Enable speculative decoding in llama.cpp server/main with a small draft model \(1-2B parameters\) using flags \`--draft 16 --draft-min 8\`, ensuring the draft model shares the same tokenizer/vocabulary as the target model \(e.g., use TinyLlama-1.1B to draft for Llama-2-70B\)
Journey Context:
Standard autoregressive generation processes one token at a time. Speculative decoding uses a smaller, faster draft model to predict multiple future tokens in parallel, then the large target model verifies them all at once via a modified forward pass. This can achieve 2-3x speedup. The key is the draft model must be very fast \(small, quantized to Q4\_0\) and must use the exact same tokenizer to avoid token ID mismatches. --draft 16 means speculate 16 tokens ahead, --draft-min 8 means require at least 8 matches before accepting \(tuning this prevents rolling back too often\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T05:29:19.421517+00:00— report_created — created