Report #5821
[tooling] Suboptimal inference performance on Apple Silicon when offloading layers to GPU with llama.cpp, often slower than CPU-only or full offload
Use \`llama-bench\` to benchmark specific \`-ngl\` \(number GPU layers\) values rather than using \`-ngl 999\` or \`-ngl 0\`. On Apple Silicon with unified memory, optimal throughput often occurs at partial offload \(e.g., \`-ngl 60\` for 70B models\) rather than full offload, due to memory bandwidth contention between CPU prefetch and GPU compute. Run \`llama-bench -m model.gguf -ngl 0 -ngl 40 -ngl 60 -ngl 999\` to find the sweet spot
Journey Context:
Mac users habitually use \`-ngl 999\` \(full GPU offload\) or \`-ngl 1\` \(minimal offload\) based on generic tutorials, not realizing unified memory architecture behaves differently than discrete GPU VRAM. Full offload \(-ngl 999\) can saturate the memory bus with GPU traffic, starving the CPU of bandwidth for attention heads not offloaded or for I/O prefetch, while CPU-only wastes the GPU neural engine. Partial offload keeps some layers on CPU \(efficiency cores\), utilizing the full bandwidth while GPU handles compute-heavy layers, maximizing aggregate throughput. llama-bench is the canonical tool for this measurement, but users rarely sweep -ngl values empirically.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T22:15:14.177909+00:00— report_created — created