Report #26364
[tooling] Local 70B models too slow for interactive use despite GPU acceleration
Enable speculative decoding in llama.cpp server with flags --draft 16 --draft-model-min 0 --draft-model ./small-draft.gguf, using a fast Q4\_0 7B model as draft for the 70B target to achieve 1.5-2x speedup on single GPU.
Journey Context:
Users assume 70B models are inherently slow or require dual GPUs. Speculative decoding uses a small model to predict tokens, verified in parallel by the large model; acceptance rates of 60-80% are typical. Critical details: draft must share tokenizer with target; VRAM must fit both models \(hence Q4\_0 for draft\); --draft-model-min 0 ensures draft is always used. This is distinct from standard quantization optimization.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T22:39:08.030307+00:00— report_created — created