Report #10480
[tooling] llama.cpp on Apple Silicon shows worse performance with partial GPU offloading \(-ngl 20\) than full CPU
Always use -ngl 999 \(offload all layers to GPU\) or -ngl 0 \(CPU only\) on Apple Silicon. Never use intermediate values for -ngl. If the model doesn't fit in unified memory, quantize further or use a smaller model rather than partial offloading.
Journey Context:
Apple Silicon uses unified memory, but llama.cpp's Metal backend historically copies tensors between CPU and GPU buffers when partial offloading \(-ngl set to a value between 0 and total layers\) is used, causing severe synchronization overhead. Users assume -ngl 20 \(20 layers on GPU\) is faster than CPU for large models, but the constant buffer copies across the memory bus \(even unified\) create a bottleneck. The framework optimizes for 'all on GPU' \(zero copies\) or 'all on CPU' \(no Metal overhead\). This was a documented performance regression in llama.cpp GitHub issues regarding Metal partial offloading. The fix is binary: all or nothing. On Apple Silicon, choose -ngl 999 if the model fits in RAM, otherwise -ngl 0 with Accelerate framework \(ARM NEON\), avoiding the partial offload penalty.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-16T10:48:19.633622+00:00— report_created — created