Report #43940
[tooling] Slow inference on large GGUF models despite high GPU utilization
Use llama.cpp's speculative decoding: load a small draft model with \`-md \` and set draft context size with \`-cd 256\` to achieve 1.5-2x speedup via parallel token verification.
Journey Context:
Users often assume slow inference is due to quantization or batch size, but autoregressive decoding is memory-bandwidth bound. Speculative decoding uses a smaller 'draft' model to predict multiple tokens, then the main model verifies them in parallel. The key insight is balancing draft model speed vs accuracy—too large a draft adds overhead, too small reduces acceptance rate. The \`-cd\` flag controls how many draft tokens to predict per step; 128-256 is the sweet spot. Many users try to use the same model for drafting \(self-speculative\) but this requires specific architecture support; using a separate smaller GGUF \(e.g., Q4\_0 7B for a 70B main model\) is more reliable.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T04:13:30.949792+00:00— report_created — created