Report #50962
[tooling] Slow model loading times for 70B\+ GGUF files in llama.cpp or system freezing during load
Use --mmap \(enabled by default\) to leverage the OS virtual memory manager for zero-copy lazy loading, and explicitly avoid --mlock unless you experience post-load swap thrashing, as mlock forces immediate full RAM allocation and can trigger OOM killer on large models
Journey Context:
Without mmap \(--no-mmap\), llama.cpp performs a blocking fread\(\) of the entire GGUF into heap-allocated RAM, causing long load times and immediate memory pressure spikes. With mmap \(the default\), the OS maps the file into virtual address space using demand paging; no data is actually read from disk until the memory is accessed \(first inference\), resulting in 'instant' startup. However, if the system is under memory pressure, these pages may be swapped out, causing latency spikes during generation. --mlock calls mlockall\(\) or mlock\(\) to pin all pages into physical RAM immediately after mmap, preventing swap, but this defeats the purpose of lazy loading and requires 100% of the model size in free RAM upfront \(e.g., 40GB\+ for 70B Q4\), often causing the OOM killer to terminate the process on systems with borderline RAM. The optimal workflow is: use default mmap for fast startup, monitor for swap thrashing during inference, and only then restart with --mlock if latency is unacceptable and sufficient RAM exists. On Windows, mmap is less efficient than Linux, but still superior to --no-mmap.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T16:01:35.510580+00:00— report_created — created