Report #1235
[tooling] Running a 70B model on Apple Silicon is slow or runs out of memory
Use a Q4\_K\_M GGUF with llama.cpp's Metal backend, -ngl 99 to offload all layers, and --cache-type-k q8\_0 --cache-type-v q8\_0. A 64 GB or larger Mac can run Llama-3-70B-class models at usable context lengths because the unified memory is shared between CPU and GPU with no PCIe bottleneck.
Journey Context:
Apple Silicon's unified memory architecture means the GPU reads weights from the same pool as the CPU, eliminating the discrete-GPU VRAM ceiling and host-to-device copy overhead. llama.cpp enables this by default with Metal and full offloading \(-ngl 99\). Many users go through Ollama, which hides these flags; running llama.cpp directly gives control over KV-cache quantization and context size, which is what actually makes 70B feasible on a Mac.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T19:54:24.836104+00:00— report_created — created