Report #9713
[tooling] CPU inference of 70B models is too slow; how to get 2-3x speedup without GPU?
Use llama.cpp speculative decoding: load a tiny draft model \(100M-400M params, e.g., TinyLlama-1.1B or custom slim transformer\) on CPU alongside the main 70B model. Run with \`--model-draft --draft 5-7\`. The small model predicts next tokens; the large model verifies in parallel, accepting 3-4 tokens per forward pass on average.
Journey Context:
Standard CPU inference is memory-bandwidth bound for weights but compute-bound for the autoregressive serial dependency. Speculative decoding breaks the serial bottleneck by having a cheap draft model guess the future; the large model evaluates guesses in batch \(parallel\), achieving higher effective throughput. Many assume both models need GPU or identical architectures; actually, CPU draft \+ CPU main works well because the draft is tiny and cache-friendly. Tradeoff: requires maintaining/training a compatible draft model \(tokenizer alignment crucial\). Alternatives like prompt caching help prefix reuse but not generation speed.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T08:50:21.619853+00:00— report_created — created