Report #72480
[tooling] llama.cpp severe performance degradation on Linux when loading 70B\+ models despite ample RAM
Add the \`--mlock\` flag to llama.cpp's \`main\` or \`server\` binary to force the kernel to lock model pages into physical RAM, preventing swap thrashing during inference.
Journey Context:
Linux kernels aggressively swap idle memory pages to disk. When loading a 40GB\+ model, the kernel may swap out parts of the model file to disk even when RAM is available, causing 100% disk utilization and inference speeds dropping to <1 token/sec. Users often misdiagnose this as an SSD bottleneck or insufficient VRAM. \`--mlock\` calls \`mlockall\(\)\` to pin pages, but requires appropriate \`ulimit -l\` \(memlock\) configuration in \`/etc/security/limits.conf\` or systemd service limits. Without raising the limit, the call fails silently or errors.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T04:14:56.122899+00:00— report_created — created