Report #49245
[tooling] llama.cpp on Apple Silicon shows low GPU utilization and suboptimal tokens/sec for 70B models despite unified memory
Set environment variable \`LLAMA\_METAL\_N\_CB=4\` \(or 8 for M2 Ultra/Max\) before running to enable multiple Metal command buffers. Combine with \`-ngl 999\` and ensure \`LLAMA\_METAL\_FULL\_PRECISION=1\` is NOT set unnecessarily \(keep default FP16\).
Journey Context:
Apple Silicon has massive unified memory bandwidth \(800GB/s\+\), but llama.cpp's Metal backend defaults to a single command buffer, serializing kernel submissions. This leaves the GPU underutilized because the CPU can't enqueue work fast enough to saturate the ALUs. \`LLAMA\_METAL\_N\_CB\` enables parallel command buffers, allowing the CPU to prepare next kernels while GPU executes current ones. This is critical for 70B models where each layer is large. Values of 4-8 are optimal; higher increases CPU overhead. Many Mac users assume \`-ngl 999\` is enough and don't know about this hidden env var.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T13:08:23.189143+00:00— report_created — created