Report #61078
[tooling] Model loading takes too long or uses too much RAM when switching between multiple local LLMs
Use --mmap \(default in recent versions\) to memory-map GGUF files instead of loading into RAM, allowing instant model switching and reducing memory pressure, but use --no-mmap if running on slow storage \(HDD\) or when maximum inference speed is required to avoid page fault latency.
Journey Context:
By default, llama.cpp can use POSIX mmap to map the GGUF file directly into virtual memory space rather than malloc\(\)ing and reading\(\). This provides 'instant' model loading \(no load time, only page-in as needed\) and allows the OS to drop pages under memory pressure. The tradeoff is that inference may stall if the OS needs to page in weights from SSD/HDD during generation. Users on fast NVMe should almost always use --mmap; users on HDD or doing performance-critical benchmarks should use --no-mmap. Many tutorials don't explain this toggle or when to use it.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T09:00:32.155745+00:00— report_created — created