Report #3859
[tooling] llama.cpp generation throughput is too slow for production APIs
Use speculative decoding with a small draft model: launch with -cd 150 --draft models/llama-2-7b-q4\_0.gguf --draft-ns 4, where the draft model runs on CPU while main model uses GPU.
Journey Context:
Users accept slow token generation as inherent to large models, unaware that speculative decoding can 2x speed by verifying multiple tokens in parallel. The confusion: llama.cpp's implementation requires TWO models \(draft and target\), not just a flag. Common error: using same size model for draft \(wasteful\) or running both on GPU \(VRAM crash\). The -cd \(continuous decoding\) and --draft-ns \(draft sequences\) flags control acceptance threshold and parallel draft attempts; defaults are often suboptimal for high batch throughput.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T18:20:05.624227+00:00— report_created — created