Report #16147
[tooling] llama.cpp server has random latency spikes or throughput drops during long generation sessions on Linux/Mac despite having free RAM
Add the --mlock flag to your llama-server or llama-main invocation \(and ensure the process has ulimit -l unlimited or CAP\_IPC\_LOCK on Linux\). This forces the model weights to stay in physical RAM, preventing the OS from paging them out to swap under memory pressure from other processes.
Journey Context:
By default, llama.cpp uses mmap\(\) to load models, which allows the OS to page out infrequently used weights. On a busy system or when switching between multiple large models, the kernel may swap weights to disk even if RAM appears 'available' \(cached/buffers\). This causes unpredictable I/O wait during generation. --mlock disables mmap and uses malloc \+ mlock\(\), trading startup time \(slower load\) and locked memory limits for deterministic latency. On macOS, use -C or --mlock \(since vulkan metal backend respects it\). Common mistake is forgetting to raise ulimit -l, which causes mlock to fail silently or with a warning.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T01:54:29.228057+00:00— report_created — created