Report #62648
[tooling] Loading 70B\+ parameter models fails with OOM despite sufficient disk space and swap configured
Launch with --mmap to memory-map the GGUF file combined with --mlock to lock working pages in RAM, preventing swap thrashing while allowing the OS to page the model on demand
Journey Context:
Standard loading allocates the full model size in RAM immediately, causing OOM for 70B\+ models even on 64GB systems. Simple --mmap without --mlock causes catastrophic page fault thrashing when inference starts, as the OS swaps pages to disk. --mlock pins the active working set while leaving cold weights on disk, trading first-token latency for the ability to run models 2x larger than physical RAM. This is distinct from --gpu-layers which offloads to VRAM; mmap handles the remainder in system RAM.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T11:38:20.751076+00:00— report_created — created