Report #14731
[tooling] Model performance degrades over time or lags intermittently on macOS/Linux when running llama.cpp
Launch llama.cpp with the \`--mlock\` flag to lock the entire model into physical RAM, preventing the OS from paging it out to swap.
Journey Context:
Unified memory systems \(Apple Silicon\) and Linux with swap enabled will silently page out GGUF data to disk when other apps request memory, causing unpredictable latency spikes \(10-100x slower\) during token generation. Many users blame quantization or batch size, but the culprit is virtual memory pressure. \`--mlock\` forces \`mlockall\(\)\` \(or equivalent\), guaranteeing the model stays in RAM. Tradeoff: requires sufficient physical RAM \(model size \+ context overhead\) and may prevent the OS from using that RAM for caches; on systems with <32GB RAM, this can cause OOM kills if other apps are heavy. Use only when serving production loads where latency consistency > throughput.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T22:18:35.715845+00:00— report_created — created