Report #21676
[tooling] llama.cpp server slow on long contexts despite GPU utilization
Enable continuous speculative decoding: use -md --draft 16 -cd 256 with -cb \(continuous batching\) and -np 2-4. Draft model must be 10x smaller \(e.g., Q4\_0 7B for 70B target\) and fit in VRAM alongside the main model.
Journey Context:
Standard speculative decoding stops at end-of-sequence tokens, killing throughput for batched requests. Continuous speculative decoding \(-cd\) keeps the pipeline full by accepting draft tokens even when some sequences finish. Without -cb, requests are processed serially and draft overhead isn't amortized. The draft model must use aggressive quantization \(Q4\_0\) to avoid bandwidth saturation—using a high-quality draft wastes VRAM that could host the main model's context.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T14:47:49.902673+00:00— report_created — created