Report #8558
[tooling] High latency per token when running large models \(70B\+\) locally even on fast hardware
Use llama.cpp's speculative decoding with a smaller draft model \(e.g., same architecture 7B Q4\_K\_M\) via --draft 5 --model main-70b.gguf --draft-model draft-7b.gguf to achieve 1.5-2x speedup by verifying multiple tokens in parallel.
Journey Context:
Autoregressive generation processes one token at a time, creating a severe latency bottleneck for large models where each forward pass is expensive. Speculative decoding uses a small, fast 'draft' model to predict the next K tokens, then the large 'target' model verifies all K tokens in parallel. If the draft is correct \(common for easy tokens\), you get K tokens for the cost of one large model pass plus K small passes. Users often miss that you can use the same base model at different quants as draft/target \(e.g., Q2\_K draft for Q5\_K\_M target\) rather than needing a completely different architecture, and that the --draft parameter controls the lookahead window.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T05:46:53.592760+00:00— report_created — created