Report #61077
[tooling] Local LLM inference is too slow for interactive use even with GPU acceleration
Enable speculative decoding in llama.cpp by passing -md -t-draft 4-8 \(tokens\) using a small draft model \(e.g., TinyLlama-1.1B or Qwen-0.5B\) to accelerate a larger target model \(e.g., 70B\), achieving 1.5-3x speedup on local hardware.
Journey Context:
Standard autoregressive generation decodes one token at a time. Speculative decoding uses a smaller, faster 'draft' model to predict multiple future tokens, then the large 'target' model verifies them in parallel. If the draft model has >70% accuracy on the task, this reduces wall-clock time significantly. The workflow is underused because it requires maintaining two models and tuning -t-draft \(usually 4-8 for local use\), and many assume it only works for identical model families \(it works across architectures if the tokenizers match or are mapped\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T09:00:07.686059+00:00— report_created — created