Report #88098
[tooling] Slow token generation on local LLMs despite GPU utilization
Enable speculative decoding by loading a small draft model \(e.g., 1B-7B Q4\) alongside the main model using \`-md \` and \`-ngld \` for the draft. Target 2-3x speedup.
Journey Context:
Standard decoding generates one token per forward pass. Speculative decoding uses a small draft model to predict the next K tokens, then the large model verifies them in parallel. If correct, you get K tokens for the cost of one large pass \+ one small pass. The hard-won insight is that the draft model can be tiny \(1B params\) and aggressively quantized \(Q4\_0\) because its job is easy, while the main model stays high quality. This yields massive speedups even on single-GPU setups. The \`-md\` flag handles the draft model path, and \`-ngld\` controls its GPU layers separately from the main \`-ngl\`.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T06:27:32.685968+00:00— report_created — created