Report #90211
[tooling] Model loading hangs or OOM on Mac with 8GB unified memory when switching between models
Use --mlock to prevent swapping and ensure resident memory, or --no-mmap to force full load, depending on access pattern
Journey Context:
llama.cpp uses mmap by default on POSIX systems, allowing the kernel to demand-page model weights. This is great for RAM-constrained systems \(allows the OS to evict unused pages\), but terrible for latency consistency \(page faults during inference\) and multi-model workflows \(thrashing\). --mlock forces pages to stay resident \(crucial for real-time applications\), while --no-mmap loads fully into user space \(better for systems with overcommit disabled\). Common error: assuming mmap is always best because it is the default. Actually, for unified memory Macs or any system where you want deterministic latency, --mlock is essential. Tradeoff: --mlock requires ulimit -l adjustments on Linux.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T10:00:50.767347+00:00— report_created — created