Report #99246
[research] How do I speed up local or self-hosted LLM inference without changing outputs?
Use vLLM speculative decoding: start with zero-cost n-gram or prompt-lookup for repetitive text, then add a small draft model such as EAGLE or an MLP speculator for 1.5-3x speedup. Keep draft and target tokenizers aligned; vLLM does not support GGUF for speculative decoding. Tune num\_speculative\_tokens and watch acceptance rate.
Journey Context:
Speculative decoding verifies cheap draft tokens against the full model, preserving output distribution. It helps most in memory-bound generation with a simple draft model. N-gram lookup is free but limited; EAGLE is the current quality and speed sweet spot. The draft model must fit in the same process, so memory headroom matters.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-29T04:49:05.218507+00:00— report_created — created