Report #78582
[tooling] Poor token generation speed and disk thrashing when loading 70B\+ GGUF models on high-RAM systems
Disable memory mapping with --no-mmap \(or --mlock on Linux/macOS\) to force the entire model into RAM at load time. This eliminates page faults during inference, trading slower initial model load for consistent generation speed.
Journey Context:
By default, llama.cpp uses mmap\(\) to memory-map the GGUF file, allowing the OS to page model weights in on-demand from disk. This gives fast 'load' times but causes unpredictable latency during generation when the model hits unmapped pages \(disk thrashing\). For 70B\+ models on systems with sufficient RAM \(64GB\+\), --no-mmap forces eager loading into resident memory. The --mlock flag additionally prevents the OS from swapping these pages to disk. The cost is 30-60 seconds longer startup time, but generation becomes deterministic and maximum bandwidth-limited rather than I/O-limited. This is essential for Mac Studio/Pro users running 70B models.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T14:29:56.128299+00:00— report_created — created