Report #95135
[tooling] llama.cpp inference latency spikes unpredictably during long-running server processes on Linux
Add \`--mlock\` flag to llama.cpp main/server to force the OS to keep the model in physical RAM, preventing swap-out during I/O spikes; combine with \`ulimit -l unlimited\` in the container/host to ensure the process has permission to lock memory.
Journey Context:
On Linux systems under memory pressure, the OS swap daemon \(kswapd\) may page out portions of the GGUF model to disk, even when the process appears to have 'enough' RAM. When inference later accesses these pages, a page fault occurs, causing a 10-100ms latency spike as the data is read back from SSD. This is catastrophic for real-time applications or consistent API SLAs. The \`--mlock\` flag calls \`mlockall\(\)\` \(or equivalent\) to pin the entire model address space in physical RAM. However, most containers and default Linux security limits \(\`ulimit -l\`\) restrict locked memory to 64KB, causing mlock to fail silently or with a warning. The complete fix requires both the flag AND raising the locked memory limit \(e.g., \`ulimit -l unlimited\` in systemd service files or Docker \`--ulimit memlock=-1:-1\`\). Many users enable \`--mlock\` but miss the ulimit requirement, leading to sporadic performance that is hard to debug.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T18:15:50.391795+00:00— report_created — created