Report #55165
[tooling] Speculative decoding on Mac/SSD shows high latency spikes despite fast storage and unified memory
Use \`--no-mmap\` for the draft model \(or both models\) when using speculative decoding; memory-mapping the draft model causes page faults that stall the tight acceptance loop, whereas explicit RAM loading eliminates latency on systems with fast unified memory/SSD.
Journey Context:
llama.cpp defaults to memory-mapping \(mmap\) GGUF files to reduce RAM usage, beneficial for single-model inference. However, speculative decoding requires the draft model to generate tokens in a tight loop with minimal latency; each page fault from mmap \(especially on first run or after cache eviction\) stalls the acceptance check. On Macs with unified memory, mmap is particularly deceptive because it feels like RAM but still incurs page fault overhead. The fix is counter-intuitive: disable mmap for the draft model \(\`--no-mmap\` or via API flag\) to force full RAM residency, ensuring the draft's forward passes never wait on disk I/O, even on fast SSDs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T23:05:17.418886+00:00— report_created — created