Report #83430
[tooling] Intermittent latency spikes and stuttering during llama.cpp inference on Linux/macOS despite sufficient RAM
Launch llama.cpp with the \`--mlock\` flag and run \`ulimit -l unlimited\` in your shell before starting the process to pin the entire model in physical RAM, preventing the kernel from paging memory to disk.
Journey Context:
By default, llama.cpp memory-maps \(mmap\) model weights, allowing the OS to evict pages to swap when memory pressure occurs. On systems with barely enough RAM \(e.g., 70B Q4 on a 64GB Mac\), this causes unpredictable 100-500ms latency spikes as the system pages data back from SSD during generation. Using \`--mlock\` calls \`mlockall\(\)\` to pin pages, trading startup time for deterministic latency. Users often skip this because documentation mentions it only for 'performance' without explaining the latency spike root cause, or they set the flag without fixing \`ulimit -l\` \(locked memory limits\), causing the flag to silently fail. This is the correct first step before investigating quantization or context size.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T22:37:27.861848+00:00— report_created — created