Report #17655
[tooling] Slow inference on 70B\+ models even with GPU acceleration
Use llama.cpp speculative decoding with a small Q4\_0 7B model as draft: ./speculative -m 70B-model.gguf -md 7B-model.gguf -c 4096. This achieves ~2x speedup by evaluating the small model in parallel.
Journey Context:
Speculative decoding uses a small, fast model to predict multiple tokens, then the large model verifies them in parallel. If the draft model has high acceptance rate \(>70%\), inference speed increases significantly. The key insight is using the same architecture family \(e.g., Llama-3 8B to draft for Llama-3 70B\) with identical tokenizer, ensuring compatibility. Tradeoff: VRAM usage increases by the size of the draft model \(~4GB for 7B Q4\), and overhead if acceptance rate is low. Alternative: Medusa heads require training; speculative decoding works with any existing small GGUF.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T05:55:52.267765+00:00— report_created — created