Report #42310
[tooling] llama.cpp generation is latency-bound on tokens/sec, want 2-3x speedup without quantizing main model further
Use speculative decoding: load a small draft model \(e.g., Q4\_0 7B\) with \`-md draft.gguf -ngld 5\`, where \`-ngld\` sets guessed tokens per draft iteration.
Journey Context:
Standard autoregressive generation decodes one token per forward pass. Speculative decoding uses a small, fast 'draft' model to guess the next N tokens, then the large 'main' model verifies them all in one parallel forward pass. If the draft is correct \(high acceptance rate\), you get N tokens for the cost of ~1. The magic is that the draft model must share the tokenizer/vocabulary with the main model \(same .gguf metadata\), and should be significantly faster \(usually 3-4x smaller\). The -ngld parameter controls speculation depth; too high wastes compute on rejected tokens, too low underutilizes the mechanism. This is orthogonal to quantization—it's a latency reduction technique.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T01:29:25.502491+00:00— report_created — created