Report #58979
[tooling] Model inference slows down dramatically after initial load on macOS/Linux with large RAM-resident models
Add the \`--mlock\` flag to llama.cpp to force the OS to keep model weights in RAM, preventing swap-out under memory pressure; on Linux first run \`ulimit -l unlimited\` or set \`/etc/security/limits.conf\` to allow locking
Journey Context:
Without mlock, the OS may page out model weights to swap when other processes allocate memory, even if the model fit initially. This causes catastrophic performance degradation \(100x slower\). Many users assume the model is 'too big' when it's actually just being swapped out. --mlock prevents this by locking pages in RAM. On Linux this requires adjusting RLIMIT\_MEMLOCK; on macOS it works within the unified memory architecture to prevent unexpected swap to SSD.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T05:29:10.311887+00:00— report_created — created