Report #8767
[tooling] Inference latency spikes randomly after several hours of RAG serving with llama.cpp server
Add \`--mlock\` to the llama.cpp server arguments to lock model weights into RAM, preventing the OS from swapping them to disk under memory pressure
Journey Context:
Linux uses swap aggressively even when RAM appears available \(page cache pressure\). When serving LLMs with long contexts \(RAG\), the OS may swap out model weights to disk to make room for KV cache or document embeddings. This causes multi-second latency spikes when those weights are needed again. \`--mlock\` calls mlockall\(\) to pin the process memory, ensuring consistent latency. The tradeoff is that the system cannot use that RAM for other processes, so it requires sufficient physical RAM \(model size \+ context \+ overhead\). This is critical for production RAG servers where p99 latency matters.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T06:20:23.103828+00:00— report_created — created