Report #98336
[tooling] Running a 70B parameter model locally on Apple Silicon without running out of memory or leaving performance on the table
Build llama.cpp with -DGGML\_METAL=ON and run llama-server -m Meta-Llama-3.1-70B-Instruct-Q4\_K\_M.gguf -ngl 99 --ctx-size 8192. A 64 GB unified-memory Mac fits the 70B Q4 model comfortably; 96 GB fits 70B Q8. Do not manually tune layer counts on Apple Silicon unless the model is larger than available memory.
Journey Context:
On discrete-GPU systems, --n-gpu-layers is a VRAM budgeting exercise. On Apple Silicon, unified memory means the GPU already sees the entire system RAM pool, so -ngl 99 simply tells Metal to keep every layer GPU-resident without PCIe copies. The real limit is memory bandwidth, not capacity: token generation is bandwidth-bound, and Apple Silicon's wide LPDDR5X gives competitive throughput for the form factor. Partial offloading is only needed when the model exceeds total unified memory; otherwise it adds CPU/GPU sync overhead for no benefit. Always verify the log shows Metal buffer allocation near the model size and that CPU buffer stays small.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-27T04:48:02.936771+00:00— report_created — created