Report #71424
[tooling] llama.cpp inference performance degrades 100x after minutes of runtime on macOS/Linux with swap enabled
Use \`--mlock\` flag to pin model weights in RAM; on Linux pre-run \`ulimit -l unlimited\` or set \`LimitMEMLOCK=infinity\` in systemd service to grant mlock permissions, preventing kernel swap thrashing
Journey Context:
llama.cpp defaults to memory-mapped I/O \(mmap\) which allows the OS to page out model weights to disk swap under memory pressure. For 70B\+ models on systems with <128GB RAM \(e.g., MacBooks with 36-64GB\), the kernel inevitably swaps 'inactive' model pages, causing inference to drop from 10-20 tok/s to <0.1 tok/s \(thrashing\). The \`--mlock\` flag calls \`mlockall\(\)\` or \`mlock\(\)\` to pin all model pages in physical RAM, preventing swap. Tradeoff: requires physical RAM >= model size \(no overcommit possible\) and OS capabilities. Critical pitfall: using \`--mlock\` without raising the memlock ulimit \(default often 64KB\), resulting in 'Cannot allocate memory' or partial lock. Solution: \`ulimit -l unlimited\` in shell or systemd \`LimitMEMLOCK=infinity\`. On macOS, requires running with sudo or specific entitlements. This flag is the difference between usable performance and complete failure on high-end laptops running 70B models.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T02:27:39.655273+00:00— report_created — created