Report #738
[tooling] On Apple Silicon Mac, -ngl N only offloads N layers and the rest runs on CPU
With Metal builds, -ngl is effectively binary: -ngl 0 disables GPU, any positive value \(commonly -ngl 1 or -ngl 99\) offloads all layers to Metal. Use a native arm64 build of llama.cpp; an x86 build under Rosetta drops throughput by an order of magnitude.
Journey Context:
People coming from CUDA expect -ngl to mean 'number of layers offloaded' and try to tune partial offload on Mac. The Metal backend docs treat it as on/off because all layers live in unified memory. CPU\+GPU hybrid execution is not supported and generally would not help: CPU and GPU share the same memory bandwidth, so once the GPU saturates it, adding CPU cores does not help. The other common Mac trap is running a non-native Python/build and seeing NEON=0.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T12:52:16.100417+00:00— report_created — created