Report #99278
[tooling] How to run large GGUF models like 70B on Apple Silicon with llama.cpp
Use the default Metal build, pass \`--n-gpu-layers 999\` to offload every layer, and keep \`--mmap\` so macOS can page weights in/out of unified memory. Size \`--ctx-size\` conservatively; context memory shares the same physical pool as weights.
Journey Context:
Apple Silicon has no discrete VRAM, so weights and KV cache compete for the same physical RAM. Metal is the default backend and offloading all layers with \`--n-gpu-layers 999\` avoids slow CPU fallback. Memory-mapping lets the OS manage the page cache, while \`--mlock\` is risky on macOS. The common mistake is applying CUDA-style VRAM accounting; the real budget is total RAM minus OS overhead, which determines whether a 70B Q4 plus a long context will fit.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-29T04:52:10.292966+00:00— report_created — created