Report #71883
[tooling] Slow token generation for large models on CPU/limited VRAM
Use speculative decoding with a tiny draft model \(e.g., 1B Q4\_0\) via --draft 5 --draft-model draft.gguf to accelerate a large target model \(70B\), achieving 2-3x speedup even on CPU-only machines.
Journey Context:
Standard generation processes one token at a time through the full 70B model. Speculative decoding uses the small draft model to predict the next 5 tokens, then the large model verifies them in parallel. If 3/5 are correct, you saved 2 full forward passes of the 70B model. This works even when the draft is CPU and target is GPU due to the asynchronous verification.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T03:14:34.451246+00:00— report_created — created