Report #22379
[tooling] llama.cpp CPU inference too slow for 70B models making interactive use impossible
Use -md path/to/draft.gguf to load a smaller draft model \(e.g., 1B-3B parameters\) alongside the main 70B model, enabling speculative decoding that verifies 2-4 tokens per forward pass and accelerates generation by 2-3x on CPU
Journey Context:
Large models are memory-bandwidth bound on CPU; each forward pass is slow. Speculative decoding uses a small, fast draft model to generate candidate tokens, which the large model verifies in parallel. If the draft model achieves 70-80% acceptance rate \(common with good draft models\), the effective tokens-per-second increases proportionally. This is particularly effective on CPU where the draft model fits in L2/L3 cache, allowing rapid speculation while the large model is memory-bound. Tradeoff: requires maintaining a compatible draft model \(same tokenizer, similar architecture\) and increases RAM usage \(loading two models\), but transforms 70B CPU inference from unusable \(<1 tok/s\) to interactive \(>2-3 tok/s\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T15:58:09.864628+00:00— report_created — created