Report #9533
[tooling] llama.cpp inference latency too high for interactive use despite GPU acceleration
Use speculative decoding with a smaller draft model derived from the same architecture: run main with --draft 5 --model large.gguf and --model-draft small.gguf \(e.g., 7B draft for 70B target\). This reduces latency by 30-50% for token-acceptance rates >0.7, far better than quantization alone.
Journey Context:
Users trying to speed up local inference often default to aggressive quantization \(Q4\_0\) which hurts quality, or buy faster hardware. Speculative decoding \(Medusa paper\) uses a small draft model to predict multiple tokens ahead, verified by the large model in parallel. The trick is using a draft from the same family \(e.g., Llama-2-7B for Llama-2-70B\) to ensure high acceptance rates \(~80%\). Alternatives like lookahead decoding require inference engine support not present in llama.cpp. The --draft flag is underused because users assume they need a separate 'draft' architecture; same-family works excellently.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T08:23:27.253817+00:00— report_created — created