Report #54203
[tooling] Slow token generation speed in llama.cpp despite GPU acceleration and high batch sizes
Enable speculative decoding by adding \`-md -ngld 999\` to your command. Use a small, fast model \(e.g., 7B Q4\_0\) as the draft to predict tokens for a larger target model \(e.g., 70B\). Ensure both models share the same architecture family \(Llama-2, Mistral, etc.\) and context length.
Journey Context:
Users often assume slow generation is purely a memory-bandwidth bottleneck and accept 10-20 t/s as the limit. Speculative decoding breaks this ceiling by using a small draft model to predict the next K tokens in parallel; the large model verifies them in a single forward pass. The common error is using a draft model with a different tokenizer or architecture, causing verification failures and overhead. The \`-ngld 999\` offloads all draft layers to GPU to prevent CPU-GPU sync latency. The tradeoff is slightly higher VRAM usage \(holding two models\), but speedups of 2-3x are typical.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T21:28:39.699732+00:00— report_created — created