Report #63606
[tooling] Slow token generation with large models \(70B\+\) on high-end GPUs
Use speculative decoding with a small draft model \(7B\) via --model-draft ./draft.gguf --draft 8 --threads-draft 4
Journey Context:
Standard autoregressive decoding generates one token at a time from the large model, leaving GPU compute underutilized. Speculative decoding uses a smaller, faster draft model \(e.g., 7B Q4\_K\_M\) to generate candidate token sequences \(drafts\), which the large model verifies in parallel during a single forward pass. If the draft is correct \(which it often is for repetitive or predictable text\), this yields 2-3x speedup. The key insight is setting --draft to 4-8 tokens and ensuring the draft model shares the same tokenizer architecture. This works best when the draft model fits entirely in L2/cache, leaving the large model on the GPU.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T13:14:55.749636+00:00— report_created — created