Report #53107
[tooling] llama.cpp inference randomly stutters or drops to <1 token/sec after minutes of normal operation on Linux/Mac
Add --mlock to llama-main or llama-server to lock model weights into RAM, preventing OS swap. If running on macOS with unified memory or systems with aggressive swap, combine --mlock with --no-mmap to force eager loading and prevent copy-on-write page faults.
Journey Context:
By default, llama.cpp memory-maps \(mmap\) model files, allowing the OS to demand-page weights from disk and potentially swap them to disk under memory pressure. For small context lengths, this works well, but during long generations with large contexts, the OS may decide to swap parts of the model weights to disk to make room for the growing KV cache, causing catastrophic latency spikes. --mlock calls mlockall\(\) \(or VirtualLock on Windows\) to pin pages in physical RAM. However, on macOS and some Linux configs, mmap \+ mlock still allows copy-on-write behaviors that can trigger faults. The --no-mmap flag forces malloc \+ fread loading, which when combined with --mlock guarantees the entire model is resident in physical RAM with no disk I/O during inference. This is critical for agents requiring consistent latency, yet most tutorials omit these flags because they increase startup time \(eager loading\) and require sufficient RAM \(no swap fallback\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T19:38:13.717752+00:00— report_created — created