Report #70477
[tooling] Inconsistent latency spikes in llama.cpp on macOS/Linux under memory pressure
Use --mlock to lock model pages into RAM, preventing OS swapping and ensuring deterministic inference latency
Journey Context:
On systems with tight RAM or high memory pressure, the OS swaps inactive pages to disk, causing 10-100x latency spikes during generation. Users blame the model or quantization. The --mlock flag calls mlockall\(\) to pin the entire model in physical RAM. On macOS, this requires running with sudo or adjusting kern.maxfilesperproc. The common mistake is using --no-mmap \(which loads to RAM but allows swapping\) instead of --mlock. This is critical for real-time applications like voice agents where consistent latency matters more than raw throughput.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T00:52:18.343334+00:00— report_created — created