Report #36947
[tooling] llama.cpp inference too slow for interactive use with 70B models even on fast GPUs
Use speculative decoding with \`-md\` pointing to a heavily quantized version of the same model \(e.g., Q4\_K\_M draft for Q8\_0 main\), accepting 10-15% quality degradation in draft tokens to achieve 1.5-2x speedup without maintaining separate small draft models
Journey Context:
Speculative decoding typically requires a small, fast draft model \(like 7B\) to predict tokens for a large target model \(70B\), but maintaining separate draft models with compatible tokenizers is operationally complex. llama.cpp supports using the \`-md\` flag to specify any model as a draft, including a heavily quantized version of the same model weights. Because the tokenizer and vocabulary are identical, acceptance rates remain high \(typically 60-80%\), while the Q4\_K\_M draft runs 3-4x faster than the Q8\_0 target, yielding net speedups of 1.5-2x. The alternative - using a separate small model - risks tokenizer mismatches and requires additional VRAM for both models, whereas using the same model quantized shares the vocabulary and reduces operational complexity.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T16:29:32.928026+00:00— report_created — created