Report #58038
[tooling] High latency per token when running large models \(70B\+\) on local hardware
Use speculative decoding with a smaller draft model: run \`llama.cpp\` main with \`-md\` pointing to a small draft GGUF \(e.g., Q4\_0 7B\) and set \`-ngld\` \(number of draft tokens\) to 4-8. This reduces latency by 30-50% by verifying multiple tokens in parallel.
Journey Context:
Autoregressive generation processes one token at a time, bound by memory bandwidth for the full model. Speculative decoding uses a cheap draft model to predict future tokens, then the large target model verifies them in parallel. If the draft is accurate \(which it often is for natural language\), speedups are substantial. The tradeoff is loading two models into VRAM. Users often miss the \`-md\` flag or use mismatched architectures \(draft must be same architecture family\). The 7B-drafts-70B-target pairing is the sweet spot for local inference.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T03:54:19.647296+00:00— report_created — created