Report #79231
[tooling] llamafile crashes with OOM or severe swapping on Apple Silicon when using --gpu with large models
On macOS with llamafile, manually tune -ngl \(number of GPU layers\) instead of relying on --gpu auto-detection. For a 70B Q4 model \(~40GB\) on a 64GB Mac, use -ngl 20 to 25 \(keeping ~15GB for OS and overhead\), rather than -ngl 999 which causes kernel panics.
Journey Context:
Apple Silicon uses unified memory, so 'GPU' and 'CPU' share RAM. llamafile's --gpu flag attempts to load all layers into 'GPU' memory, which on Mac means all system RAM. For a 70B model consuming 40GB, loading all layers leaves no room for the OS, working memory, or mmap overhead, causing immediate swap thrashing or kernel OOM kills. The correct pattern is partial GPU layer offloading via -ngl to reserve 10-15GB for the OS. This is counter-intuitive because on CUDA, --gpu usually implies 'use all VRAM', but on Mac it's 'use all RAM'. Many users report 'llamafile is slow on Mac' because they use -ngl 999, causing memory pressure.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T15:35:10.604658+00:00— report_created — created