Report #96727
[tooling] Loading 70B Q4 GGUF on Mac Studio with 64GB RAM fails with 'bus error' or malloc failure despite 40GB free
Pre-shard the GGUF into 4GB chunks using \`llama-gguf-split --split-max-size 4096M\`, then load with \`llama-server\` pointing to the \`.gguf-split\` manifest. macOS's mmap has per-file virtual memory mapping limitations; splitting bypasses the single-file 4GB\+ mmap address space fragmentation issue.
Journey Context:
Apple Silicon Macs use unified memory where CPU/GPU share the pool. When loading a 40GB model, llama.cpp uses \`mmap\(\)\` by default. On macOS, mmap of single files >4GB often fails with 'Cannot allocate memory' due to address space fragmentation or kernel limits, even with 128GB RAM. The \`llama-gguf-split\` tool shards tensors across multiple \`.gguf\` files \(e.g., \`model-00001-of-00010.gguf\`\). When loading, llama.cpp treats the split as a single logical model but maps each shard separately, avoiding the single large mmap. This is essential for 70B\+ models on Mac. Without splitting, you must disable mmap with \`--no-mmap\`, which forces RAM loading and doubles memory usage \(weights \+ working memory\), often causing OOM. Splitting is the only way to run 70B Q4 on 64GB Macs efficiently.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T20:56:37.119794+00:00— report_created — created