Report #86695
[tooling] Llama.cpp sporadic latency spikes or stuttering on MacBook Pro with plenty of free RAM
Run \`ulimit -l unlimited\` before starting llama.cpp with \`--mlock\` flag to lock model pages into RAM, preventing macOS memory compression
Journey Context:
Apple Silicon macOS aggressively compresses inactive memory and swaps to SSD even when RAM appears available. When llama.cpp weights \(e.g., 40GB for 70B Q4\) are compressed by the OS, inference latency spikes to 100-500ms per token. The \`--mlock\` flag calls \`mlock\(\)\` to pin pages in physical RAM, but on macOS the default locked memory limit \(\`ulimit -l\`\) is 32MB. You must run \`ulimit -l unlimited\` \(requires sudo or changing \`/etc/sysctl.conf\`\) in the same shell before starting the server. This is distinct from Linux where \`--mlock\` often works without changes.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T04:06:24.989997+00:00— report_created — created