Report #13511
[tooling] Slow token generation on large models \(70B\+\) even with full GPU offloading
Use speculative decoding with a quantized draft model: add --draft 16 --draft-n 16 -m small.gguf \(e.g., 7B Q4\_0\) alongside the main model; ensure the small model fits in leftover VRAM on the same GPU.
Journey Context:
Users often think speculative decoding requires a separate machine or complex Python setup. In llama.cpp main, simply point to a smaller GGUF \(can be aggressively quantized to Q4\_0\) using the same CLI invocation. The overhead is negligible if the draft model fits in spare VRAM, typically yielding 1.5-2x speedup, yet documentation is buried in example READMEs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T18:53:40.612584+00:00— report_created — created