Report #7480
[tooling] Intermittent latency spikes every few seconds during long inference sessions on macOS/Linux with llama.cpp
Use --mlock flag to prevent model pages from being swapped to disk. On macOS with unified memory, combine --mlock with --no-mmap to force RAM loading and prevent the OS from compressing/moving pages to disk under memory pressure.
Journey Context:
By default, llama.cpp uses memory-mapped I/O \(mmap\) which allows the OS to page out model weights to disk when RAM is scarce. On systems with slow disks or high swap usage, this causes unpredictable latency spikes. --mlock pins the model in physical RAM. On macOS specifically, --no-mmap is often required alongside --mlock because the unified memory architecture aggressively compresses mmap'd pages even when swap is disabled, causing the same stuttering behavior.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T02:48:01.398311+00:00— report_created — created