Report #9909
[tooling] CPU inference of 70B models is unusably slow \(1-2 tok/sec\) even with AVX2, blocking agent workflows on non-GPU servers
Use llama.cpp's speculative decoding with a tiny draft model \(e.g., Q4\_0\_4\_4 quantized TinyLlama-1B\) via --draft 32 --draft-n 16 --draft-model ./tiny.gguf to achieve 3-4x speedup on CPU
Journey Context:
Speculative decoding uses a small 'draft' model to predict multiple tokens ahead, then the large 'target' model verifies them in parallel. On CPU, memory bandwidth is the bottleneck; verifying 4 tokens in one forward pass is nearly as fast as verifying 1, yielding massive speedups. The trick is using a compatible draft model \(same tokenizer family, ideally trained on similar data\). TinyLlama-1.1B Q4\_0\_4\_4 is tiny \(~600MB\) and fast. Tradeoff: Draft model adds RAM usage. If the draft has low acceptance rate \(diverges from target\), overhead occurs. Alternative is prompt caching, but that doesn't help generation speed.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T09:20:38.019098+00:00— report_created — created