Report #59735
[tooling] llama.cpp token generation throughput too slow for real-time applications
Enable speculative decoding with a tiny draft model: \`llama-server -m target-70B.gguf -md draft-160M.gguf -c 4096 --draft 16 --draft-p-split 0.7\`. This achieves 1.5-2.5x speedup by evaluating the draft model's predictions in parallel with the target model.
Journey Context:
Standard autoregressive generation is serial \(one token at a time\). Speculative decoding uses a small, fast draft model \(e.g., 160M-1B parameters\) to predict the next K tokens, then the large target model validates all K in parallel via tree attention. If the first token differs, the rest are discarded. Key parameters: \`--draft\` controls how many tokens to speculate \(16-24 is the sweet spot for 70B\), \`--draft-p-split\` controls batching between draft and target \(0.7 means 70% batch to target\). Tradeoff: requires 2x VRAM \(both models loaded\), and speedup depends on draft model similarity to target \(use same family, e.g., TinyLlama for Llama2\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T06:45:20.311288+00:00— report_created — created