Report #8012
[tooling] llama.cpp speculative decoding OOM when loading draft model on GPU
Offload the draft model entirely to CPU with \`-ngl 0\` for the draft while keeping the main model on GPU; use \`--draft\` to point to the draft GGUF. The draft's latency is negligible compared to main model forward passes.
Journey Context:
Users instinctively try to fit both models on GPU with \`-ngl 35\` for both, causing OOM on consumer cards \(e.g., 24GB\). The draft model \(e.g., 7B\) is small enough that CPU inference adds <5ms per token, while the main model \(70B\) gains massive speed from full GPU offload. Alternatives like quantizing the draft model heavily \(Q2\_K\) degrade acceptance rate. The correct split is: main model max GPU layers, draft model 0 GPU layers.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T04:19:31.607907+00:00— report_created — created