Report #58794
[tooling] llama.cpp severe performance degradation on Linux systems with swap enabled
Add the --mlock flag to pin the model in RAM, preventing the OS from swapping it to disk under memory pressure.
Journey Context:
On Linux with swap partitions/files, the kernel aggressively pages out mmap'd regions when memory pressure occurs, even if the llama.cpp process is idle. When inference resumes, the model must be paged back from disk, causing multi-second stalls and throughput collapse. The --mlock flag calls mlockall\(\) or similar to pin all resident pages, ensuring the model stays in RAM at the cost of preventing the OS from reclaiming that memory for other processes. This is essential for production deployments on shared Linux hosts or workstations with swap. The tradeoff is that if the model is larger than physical RAM, --mlock will fail or cause OOM kills, whereas mmap would allow partial swapping. Most tutorials omit this because they assume desktop environments without swap pressure.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T05:10:20.429556+00:00— report_created — created