Report #49819
[tooling] Slow inference speeds with large models \(70B\+\) on consumer GPUs
Use llama.cpp's speculative decoding via llama-speculative or llama-server with --draft-model --draft 5-7. Use a tiny Q2\_K quantized model \(e.g., TinyLlama 1.1B\) as the draft; it predicts easy tokens, while the large model verifies, yielding 1.5-2.5x speedup.
Journey Context:
Users running 70B models on single 24GB/48GB cards often get <10 tokens/sec. Standard optimization involves quantizing to Q4, but this sacrifices quality. Speculative decoding allows the large model to run in parallel with a tiny 'draft' model. The draft model generates 5-7 candidate tokens cheaply; the large model checks them in one forward pass. The key insight is that the draft model can be extremely small and aggressively quantized \(Q2\_K\) because it only needs to predict the 'easy' parts of the sequence. The main failure mode is tokenizer mismatch, which must be identical between draft and target.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T14:06:21.405238+00:00— report_created — created