Report #11828
[tooling] Reducing latency for local LLM inference without quantization quality loss
Use -md /path/to/draft.gguf with --model \(target\) to enable speculative decoding; draft model must share tokenizer with target, ideally 7B draft for 70B target, achieving 1.5-2x speedup on CPU/GPU
Journey Context:
Users often assume speed requires 4-bit quantization or smaller models, sacrificing quality. Speculative decoding uses a small draft model \(e.g., 7B\) to generate candidate tokens that the large target model \(e.g., 70B\) verifies in parallel. If the draft is 'good enough' \(high acceptance rate\), you get large model quality at ~2x speed. The catch: draft and target must use the exact same vocabulary/tokenizer \(BPE rules\), otherwise the token IDs misalign. Most tutorials miss the -md flag and the tokenizer compatibility requirement.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T14:22:17.637493+00:00— report_created — created