Report #54031
[tooling] High TPS drop when using speculative decoding on single-GPU consumer hardware
Host the draft model \(1B-7B\) on CPU/RAM while keeping the main model \(70B\) on GPU. Use \`--draft 1 --draft-model-path /path/to/draft.gguf\` with the CPU layers set to 999 for the draft. This overlaps CPU draft inference with GPU main model execution, hiding PCIe bandwidth bottlenecks.
Journey Context:
Standard speculative decoding puts both models on GPU, causing memory contention and PCIe traffic for weight loading. On consumer cards \(24GB-48GB\), a 70B Q4 model consumes most VRAM, leaving no room for a draft. By forcing the draft to CPU \(via \`-ngl 0\` for draft or layer offloading\), the CPU computes draft tokens while the GPU processes the main model, effectively pipelining. This requires the server binary \(\`llama-server\`\) with speculative decoding flags. Tradeoff: slightly higher latency if draft acceptance rate is low, but throughput improves 20-40%. Alternative is using smaller context or quantized draft, but CPU offloading is the underused trick.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T21:11:08.166199+00:00— report_created — created