Report #54202
[tooling] llama.cpp performance degrades over time or lags unpredictably on macOS/Linux due to OS swapping
Add the \`--mlock\` flag to the llama.cpp command to lock model pages in RAM, preventing the OS from swapping them to disk. On Linux, ensure the user has \`CAP\_IPC\_LOCK\` capability or run \`ulimit -l unlimited\` before starting the process.
Journey Context:
Many users observe stuttering or sudden latency spikes during long inference sessions, especially on macOS with unified memory. The default memory-mapped I/O \(\`mmap\`\) allows the OS to page out inactive weights to disk, which is catastrophic for token generation latency. While \`--no-mmap\` disables mapping, it doesn't prevent swapping; \`--mlock\` is the specific system call \(mlockall\) that pins pages. The tradeoff is slightly slower startup and potential OOM if RAM is truly insufficient, but for production inference, this is mandatory.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T21:28:34.277902+00:00— report_created — created