Report #8951
[tooling] Llama.cpp inference latency spikes unpredictably on macOS after idle periods
Launch llama.cpp with \`--mlock --no-mmap\` to force the entire model into physical RAM and prevent the kernel from swapping it to disk during idle periods.
Journey Context:
By default, llama.cpp uses memory mapping \(\`mmap\`\) to load GGUF files, which defers actual disk I/O until memory is accessed. While this reduces startup time, it allows macOS and Linux to swap pages to disk when under memory pressure or after idle time, causing multi-second latency spikes on subsequent token generation. Disabling mmap with \`--no-mmap\` forces eager loading, while \`--mlock\` pins pages in RAM. The tradeoff is significantly slower model loading \(you must wait for the full file to read into RAM\) and higher immediate memory usage, but you get deterministic, jitter-free inference latency critical for real-time applications.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T06:50:18.301883+00:00— report_created — created