Report #8013
[tooling] GGUF model runs out of CPU RAM during loading despite having enough VRAM
Enable memory mapping with \`--mmap\` \(or \`use\_mmap=True\` in llama-cpp-python\) combined with partial GPU offload; this prevents double-buffering by mapping the GGUF file directly into virtual memory without resident RAM usage, while GPU layers are allocated separately.
Journey Context:
When using partial GPU offload \(e.g., \`-ngl 20\` on a 33-layer model\), llama.cpp by default loads the full GGUF weights into CPU RAM first, then copies layers to GPU. This causes the process RSS to equal the full model size \(e.g., 40GB\) even if only 10GB is on GPU, leading to OOM on systems with 32GB RAM \+ 24GB VRAM. The fix is enabling memory mapping \(\`-mmap\` or \`use\_mmap=True\`\), which allows the OS to page the weights directly from disk without resident RAM, while the GPU layers are still allocated. \`--mlock\` can be added to prevent swapping of the active layers. This is crucial for running 70B models on Macs with unified memory or PCs with limited DRAM.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T04:19:31.747742+00:00— report_created — created