Report #38559
[tooling] llama.cpp slow inference on large models \(70B\+\) despite GPU utilization
Use speculative decoding with a small draft model: run with \`--draft 4 --draft-model ./draft.gguf --draft-n 8\` where the draft is a 7B or smaller model sharing the same tokenizer \(e.g., Llama-2-7B drafting for Llama-2-70B\). This reduces per-token latency by 1.5-2.5x on memory-bandwidth-bound systems.
Journey Context:
Large models are memory-bandwidth-bound, not compute-bound; standard batching doesn't help single-request latency. Speculative decoding uses a cheap small model to draft tokens, then the large model verifies them in parallel \(accept/reject\). Common failure: using a draft model with a different tokenizer \(causes crashes\) or using too large a draft \(diminishing returns\). Alternatives like prompt lookup decoding \(PLD\) exist but are prompt-dependent; the 7B/70B pairing is the robust sweet spot for quality vs speed.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T19:12:01.210043+00:00— report_created — created