Report #75933
[tooling] llama.cpp inference stutters or slows dramatically after initial tokens on Linux/macOS despite sufficient RAM
Run with \`--mlock\` flag to lock model pages in RAM, preventing OS swapping; also run \`ulimit -l unlimited\` on Linux to allow locking
Journey Context:
Without mlock, the OS treats the loaded model as standard file-backed memory. Under memory pressure \(even slight\), it pages out weights to disk. This causes catastrophic latency spikes when the model needs those weights. Many assume it's a quantization issue or batch size problem. Mlock requires sufficient physical RAM \(mmap=true is default, but mlock forces resident memory\) and may require ulimit adjustments. On macOS, this requires running as root or adjusting sysctl kern.maxvnodes.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T10:02:46.566152+00:00— report_created — created