Report #70925
[tooling] llama.cpp generation latency too high for 70B model, seeking speedup without quantization quality loss
Use speculative decoding with --draft 16 --model-draft tiny-1B-Q2\_K.gguf alongside main 70B model; draft model generates candidate tokens, main model verifies in parallel, achieving 2-3x speedup
Journey Context:
Standard autoregressive generation decodes one token at a time, memory-bandwidth bound. Speculative decoding uses a small, fast draft model \(e.g., 1B Q2\_K\) to generate K candidate tokens speculatively. The large main model \(70B\) then verifies all K tokens in a single forward pass \(parallel evaluation\). If all tokens accepted, speedup ~K; if rejected, resume from last good token. Tradeoff: requires loading two models \(VRAM pressure\), and draft quality affects acceptance rate. Most users don't know llama.cpp supports this natively with --draft and --model-draft. Critical for interactive 70B usage on desktop.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T01:37:30.990241+00:00— report_created — created